Pul xfs updates from Dave Chinner:
"There's not a huge amount of change in this cycle - Darrick has been
out of action for a couple of months (hence me sending the last few
pull requests), so we decided a quiet cycle mainly focussed on bug
fixes was a good idea. Darrick will take the helm again at the end of
this merge window.
FYI, I may be sending another update later in the cycle - there's a
pending rework of the clone/dedupe_file_range code that fixes numerous
bugs that is spread amongst the VFS, XFS and ocfs2 code. It has been
reviewed and tested, Al and I just need to work out the details of the
merge, so it may come from him rather than me.
Summary:
- only support filesystems with unwritten extents
- add definition for statfs XFS magic number
- remove unused parameters around reflink code
- more debug for dangling delalloc extents
- cancel COW extents on extent swap targets
- fix quota stats output and clean up the code
- refactor some of the attribute code in preparation for parent
pointers
- fix several buffer handling bugs"
* tag 'xfs-4.20-merge-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (21 commits)
xfs: cancel COW blocks before swapext
xfs: clear ail delwri queued bufs on unmount of shutdown fs
xfs: use offsetof() in place of offset macros for __xfsstats
xfs: Fix xqmstats offsets in /proc/fs/xfs/xqmstat
xfs: fix use-after-free race in xfs_buf_rele
xfs: Add attibute remove and helper functions
xfs: Add attibute set and helper functions
xfs: Add helper function xfs_attr_try_sf_addname
xfs: Move fs/xfs/xfs_attr.h to fs/xfs/libxfs/xfs_attr.h
xfs: issue log message on user force shutdown
xfs: fix buffer state management in xrep_findroot_block
xfs: always assign buffer verifiers when one is provided
xfs: xrep_findroot_block should reject root blocks with siblings
xfs: add a define for statfs magic to uapi
xfs: print dangling delalloc extents
xfs: fix fork selection in xfs_find_trim_cow_extent
xfs: remove the unused trimmed argument from xfs_reflink_trim_around_shared
xfs: remove the unused shared argument to xfs_reflink_reserve_cow
xfs: handle zeroing in xfs_file_iomap_begin_delay
xfs: remove suport for filesystems without unwritten extent flag
...
Pull security subsystem updates from James Morris:
"In this patchset, there are a couple of minor updates, as well as some
reworking of the LSM initialization code from Kees Cook (these prepare
the way for ordered stackable LSMs, but are a valuable cleanup on
their own)"
* 'next-general' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security:
LSM: Don't ignore initialization failures
LSM: Provide init debugging infrastructure
LSM: Record LSM name in struct lsm_info
LSM: Convert security_initcall() into DEFINE_LSM()
vmlinux.lds.h: Move LSM_TABLE into INIT_DATA
LSM: Convert from initcall to struct lsm_info
LSM: Remove initcall tracing
LSM: Rename .security_initcall section to .lsm_info
vmlinux.lds.h: Avoid copy/paste of security_init section
LSM: Correctly announce start of LSM initialization
security: fix LSM description location
keys: Fix the use of the C++ keyword "private" in uapi/linux/keyctl.h
seccomp: remove unnecessary unlikely()
security: tomoyo: Fix obsolete function
security/capabilities: remove check for -EINVAL
Pull siginfo updates from Eric Biederman:
"I have been slowly sorting out siginfo and this is the culmination of
that work.
The primary result is in several ways the signal infrastructure has
been made less error prone. The code has been updated so that manually
specifying SEND_SIG_FORCED is never necessary. The conversion to the
new siginfo sending functions is now complete, which makes it
difficult to send a signal without filling in the proper siginfo
fields.
At the tail end of the patchset comes the optimization of decreasing
the size of struct siginfo in the kernel from 128 bytes to about 48
bytes on 64bit. The fundamental observation that enables this is by
definition none of the known ways to use struct siginfo uses the extra
bytes.
This comes at the cost of a small user space observable difference.
For the rare case of siginfo being injected into the kernel only what
can be copied into kernel_siginfo is delivered to the destination, the
rest of the bytes are set to 0. For cases where the signal and the
si_code are known this is safe, because we know those bytes are not
used. For cases where the signal and si_code combination is unknown
the bits that won't fit into struct kernel_siginfo are tested to
verify they are zero, and the send fails if they are not.
I made an extensive search through userspace code and I could not find
anything that would break because of the above change. If it turns out
I did break something it will take just the revert of a single change
to restore kernel_siginfo to the same size as userspace siginfo.
Testing did reveal dependencies on preferring the signo passed to
sigqueueinfo over si->signo, so bit the bullet and added the
complexity necessary to handle that case.
Testing also revealed bad things can happen if a negative signal
number is passed into the system calls. Something no sane application
will do but something a malicious program or a fuzzer might do. So I
have fixed the code that performs the bounds checks to ensure negative
signal numbers are handled"
* 'siginfo-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (80 commits)
signal: Guard against negative signal numbers in copy_siginfo_from_user32
signal: Guard against negative signal numbers in copy_siginfo_from_user
signal: In sigqueueinfo prefer sig not si_signo
signal: Use a smaller struct siginfo in the kernel
signal: Distinguish between kernel_siginfo and siginfo
signal: Introduce copy_siginfo_from_user and use it's return value
signal: Remove the need for __ARCH_SI_PREABLE_SIZE and SI_PAD_SIZE
signal: Fail sigqueueinfo if si_signo != sig
signal/sparc: Move EMT_TAGOVF into the generic siginfo.h
signal/unicore32: Use force_sig_fault where appropriate
signal/unicore32: Generate siginfo in ucs32_notify_die
signal/unicore32: Use send_sig_fault where appropriate
signal/arc: Use force_sig_fault where appropriate
signal/arc: Push siginfo generation into unhandled_exception
signal/ia64: Use force_sig_fault where appropriate
signal/ia64: Use the force_sig(SIGSEGV,...) in ia64_rt_sigreturn
signal/ia64: Use the generic force_sigsegv in setup_frame
signal/arm/kvm: Use send_sig_mceerr
signal/arm: Use send_sig_fault where appropriate
signal/arm: Use force_sig_fault where appropriate
...
Pull networking updates from David Miller:
1) Add VF IPSEC offload support in ixgbe, from Shannon Nelson.
2) Add zero-copy AF_XDP support to i40e, from Björn Töpel.
3) All in-tree drivers are converted to {g,s}et_link_ksettings() so we
can get rid of the {g,s}et_settings ethtool callbacks, from Michal
Kubecek.
4) Add software timestamping to veth driver, from Michael Walle.
5) More work to make packet classifiers and actions lockless, from Vlad
Buslov.
6) Support sticky FDB entries in bridge, from Nikolay Aleksandrov.
7) Add ipv6 version of IP_MULTICAST_ALL sockopt, from Andre Naujoks.
8) Support batching of XDP buffers in vhost_net, from Jason Wang.
9) Add flow dissector BPF hook, from Petar Penkov.
10) i40e vf --> generic iavf conversion, from Jesse Brandeburg.
11) Add NLA_REJECT netlink attribute policy type, to signal when users
provide attributes in situations which don't make sense. From
Johannes Berg.
12) Switch TCP and fair-queue scheduler over to earliest departure time
model. From Eric Dumazet.
13) Improve guest receive performance by doing rx busy polling in tx
path of vhost networking driver, from Tonghao Zhang.
14) Add per-cgroup local storage to bpf
15) Add reference tracking to BPF, from Joe Stringer. The verifier can
now make sure that references taken to objects are properly released
by the program.
16) Support in-place encryption in TLS, from Vakul Garg.
17) Add new taprio packet scheduler, from Vinicius Costa Gomes.
18) Lots of selftests additions, too numerous to mention one by one here
but all of which are very much appreciated.
19) Support offloading of eBPF programs containing BPF to BPF calls in
nfp driver, frm Quentin Monnet.
20) Move dpaa2_ptp driver out of staging, from Yangbo Lu.
21) Lots of u32 classifier cleanups and simplifications, from Al Viro.
22) Add new strict versions of netlink message parsers, and enable them
for some situations. From David Ahern.
23) Evict neighbour entries on carrier down, also from David Ahern.
24) Support BPF sk_msg verdict programs with kTLS, from Daniel Borkmann
and John Fastabend.
25) Add support for filtering route dumps, from David Ahern.
26) New igc Intel driver for 2.5G parts, from Sasha Neftin et al.
27) Allow vxlan enslavement to bridges in mlxsw driver, from Ido
Schimmel.
28) Add queue and stack map types to eBPF, from Mauricio Vasquez B.
29) Add back byte-queue-limit support to r8169, with all the bug fixes
in other areas of the driver it works now! From Florian Westphal and
Heiner Kallweit.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (2147 commits)
tcp: add tcp_reset_xmit_timer() helper
qed: Fix static checker warning
Revert "be2net: remove desc field from be_eq_obj"
Revert "net: simplify sock_poll_wait"
net: socionext: Reset tx queue in ndo_stop
net: socionext: Add dummy PHY register read in phy_write()
net: socionext: Stop PHY before resetting netsec
net: stmmac: Set OWN bit for jumbo frames
arm64: dts: stratix10: Support Ethernet Jumbo frame
tls: Add maintainers
net: ethernet: ti: cpsw: unsync mcast entries while switch promisc mode
octeontx2-af: Support for NIXLF's UCAST/PROMISC/ALLMULTI modes
octeontx2-af: Support for setting MAC address
octeontx2-af: Support for changing RSS algorithm
octeontx2-af: NIX Rx flowkey configuration for RSS
octeontx2-af: Install ucast and bcast pkt forwarding rules
octeontx2-af: Add LMAC channel info to NIXLF_ALLOC response
octeontx2-af: NPC MCAM and LDATA extract minimal configuration
octeontx2-af: Enable packet length and csum validation
octeontx2-af: Support for VTAG strip and capture
...
The way our hardware is designed doesn't seem to let us use the
MI_RECORD_PERF_COUNT command without setting up a circular buffer.
In the case where the user didn't request OA reports to be available
through the i915 perf stream, we can set the OA buffer to the minimum
size to avoid consuming memory which won't be used by the driver.
v2: Simplify oa buffer size exponent selection (Chris)
Reuse vma size field (Lionel)
v3: Restrict size opening parameter to values supported by HW (Chris)
v4: Drop out of date comment (Matt)
Add debug message when buffer size is rejected (Matt)
Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Matthew Auld <matthew.auld@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20181023100707.31738-5-lionel.g.landwerlin@intel.com
Pull GPIO updates from Linus Walleij:
"This is the bulk of GPIO changes for the v4.20 series:
Core changes:
- A patch series from Hans Verkuil to make it possible to
enable/disable IRQs on a GPIO line at runtime and drive GPIO lines
as output without having to put/get them from scratch.
The irqchip callbacks have been improved so that they can use only
the fastpatch callbacks to enable/disable irqs like any normal
irqchip, especially the gpiod_lock_as_irq() has been improved to be
callable in fastpath context.
A bunch of rework had to be done to achieve this but it is a big
win since I never liked to restrict this to slowpath. The only call
requireing slowpath was try_module_get() and this is kept at the
.request_resources() slowpath callback. In the GPIO CEC driver this
is a big win sine a single line is used for both outgoing and
incoming traffic, and this needs to use IRQs for incoming traffic
while actively driving the line for outgoing traffic.
- Janusz Krzysztofik improved the GPIO array API to pass a "cookie"
(struct gpio_array) and a bitmap for setting or getting multiple
GPIO lines at once.
This improvement orginated in a specific need to speed up an OMAP1
driver and has led to a much better API and real performance gains
when the state of the array can be used to bypass a lot of checks
and code when we want things to go really fast.
The previous code would minimize the number of calls down to the
driver callbacks assuming the CPU speed was orders of magnitude
faster than the I/O latency, but this assumption was wrong on
several platforms: what we needed to do was to profile and improve
the speed on the hot path of the array functions and this change is
now completed.
- Clean out the painful and hard to grasp BNF experiments from the
device tree bindings. Future approaches are looking into using JSON
schema for this purpose. (Rob Herring is floating a patch series.)
New drivers:
- The RCAR driver now supports r8a774a1 (RZ/G2M).
- Synopsys GPIO via CREGs driver.
Major improvements:
- Modernization of the EP93xx driver to use irqdomain and other
contemporary concepts.
- The ingenic driver has been merged into the Ingenic pin control
driver and removed from the GPIO subsystem.
- Debounce support in the ftgpio010 driver"
* tag 'gpio-v4.20-1' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-gpio: (116 commits)
gpio: Clarify kerneldoc on gpiochip_set_chained_irqchip()
gpio: Remove unused 'irqchip' argument to gpiochip_set_cascaded_irqchip()
gpio: Drop parent irq assignment during cascade setup
mmc: pwrseq_simple: Fix incorrect handling of GPIO bitmap
gpio: fix SNPS_CREG kconfig dependency warning
gpiolib: Initialize gdev field before is used
gpio: fix kernel-doc after devres.c file rename
gpio: fix doc string for devm_gpiochip_add_data() to not talk about irq_chip
gpio: syscon: Fix possible NULL ptr usage
gpiolib: Show correct direction from the beginning
pinctrl: msm: Use init_valid_mask exported function
gpiolib: Add init_valid_mask exported function
GPIO: add single-register GPIO via CREG driver
dt-bindings: Document the Synopsys GPIO via CREG bindings
gpio: mockup: use device properties instead of platform_data
gpio: Slightly more helpful debugfs
gpio: omap: Remove set but not used variable 'dev'
gpio: omap: drop omap_gpio_list
Accept partial 'gpio-line-names' property.
gpio: omap: get rid of the conditional PM runtime calls
...
ASoC: Updates for v5.0/v4.20
As ever there's a lot of small and driver specific changes going on
here, but we do also have some relatively large changes in the core
thanks to the hard work of Charles and Morimoto-san:
- More component transitions from Morimoto-san, I think we're about
finished with this. Thanks for all the hard work!
- Morimoto-san also added a bunch of for_each_foo macros
- A bunch of cleanups and fixes for DAPM from Charles.
- MCLK support for several different devices, including CS42L51, STM32
SAI, and MAX98373.
- Support for Allwinner A64 CODEC analog, Intel boards with DA7219 and
MAX98927, Meson AXG PDM inputs, Nuvoton NAU8822, Renesas R8A7744 and
TI PCM3060.
Daniel Borkmann says:
====================
pull-request: bpf-next 2018-10-21
The following pull-request contains BPF updates for your *net-next* tree.
The main changes are:
1) Implement two new kind of BPF maps, that is, queue and stack
map along with new peek, push and pop operations, from Mauricio.
2) Add support for MSG_PEEK flag when redirecting into an ingress
psock sk_msg queue, and add a new helper bpf_msg_push_data() for
insert data into the message, from John.
3) Allow for BPF programs of type BPF_PROG_TYPE_CGROUP_SKB to use
direct packet access for __skb_buff, from Song.
4) Use more lightweight barriers for walking perf ring buffer for
libbpf and perf tool as well. Also, various fixes and improvements
from verifier side, from Daniel.
5) Add per-symbol visibility for DSO in libbpf and hide by default
global symbols such as netlink related functions, from Andrey.
6) Two improvements to nfp's BPF offload to check vNIC capabilities
in case prog is shared with multiple vNICs and to protect against
mis-initializing atomic counters, from Jakub.
7) Fix for bpftool to use 4 context mode for the nfp disassembler,
also from Jakub.
8) Fix a return value comparison in test_libbpf.sh and add several
bpftool improvements in bash completion, documentation of bpf fs
restrictions and batch mode summary print, from Quentin.
9) Fix a file resource leak in BPF selftest's load_kallsyms()
helper, from Peng.
10) Fix an unused variable warning in map_lookup_and_delete_elem(),
from Alexei.
11) Fix bpf_skb_adjust_room() signature in BPF UAPI helper doc,
from Nicolas.
12) Add missing executables to .gitignore in BPF selftests, from Anders.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
This allows user to push data into a msg using sk_msg program types.
The format is as follows,
bpf_msg_push_data(msg, offset, len, flags)
this will insert 'len' bytes at offset 'offset'. For example to
prepend 10 bytes at the front of the message the user can,
bpf_msg_push_data(msg, 0, 10, 0);
This will invalidate data bounds so BPF user will have to then recheck
data bounds after calling this. After this the msg size will have been
updated and the user is free to write into the added bytes. We allow
any offset/len as long as it is within the (data, data_end) range.
However, a copy will be required if the ring is full and its possible
for the helper to fail with ENOMEM or EINVAL errors which need to be
handled by the BPF program.
This can be used similar to XDP metadata to pass data between sk_msg
layer and lower layers.
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Pablo Neira Ayuso says:
====================
Netfilter updates for net-next
The following patchset contains Netfilter updates for your net-next tree:
1) Use lockdep_is_held() in ipset_dereference_protected(), from Lance Roy.
2) Remove unused variable in cttimeout, from YueHaibing.
3) Add ttl option for nft_osf, from Fernando Fernandez Mancera.
4) Use xfrm family to deal with IPv6-in-IPv4 packets from nft_xfrm,
from Florian Westphal.
5) Simplify xt_osf_match_packet().
6) Missing ct helper alias definition in snmp_trap helper, from Taehee Yoo.
7) Remove unnecessary parameter in nf_flow_table_cleanup(), from Taehee Yoo.
8) Remove unused variable definitions in nft_{dup,fwd}, from Weongyo Jeong.
9) Remove empty net/netfilter/nfnetlink_log.h file, from Taehee Yoo.
10) Revert xt_quota updates remain option due to problems in the listing
path for 32-bit arches, from Maze.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
The previous patch implemented a bpf queue/stack maps that
provided the peek/pop/push functions. There is not a direct
relationship between those functions and the current maps
syscalls, hence a new MAP_LOOKUP_AND_DELETE_ELEM syscall is added,
this is mapped to the pop operation in the queue/stack maps
and it is still to implement in other kind of maps.
Signed-off-by: Mauricio Vasquez B <mauricio.vasquez@polito.it>
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Queue/stack maps implement a FIFO/LIFO data storage for ebpf programs.
These maps support peek, pop and push operations that are exposed to eBPF
programs through the new bpf_map[peek/pop/push] helpers. Those operations
are exposed to userspace applications through the already existing
syscalls in the following way:
BPF_MAP_LOOKUP_ELEM -> peek
BPF_MAP_LOOKUP_AND_DELETE_ELEM -> pop
BPF_MAP_UPDATE_ELEM -> push
Queue/stack maps are implemented using a buffer, tail and head indexes,
hence BPF_F_NO_PREALLOC is not supported.
As opposite to other maps, queue and stack do not use RCU for protecting
maps values, the bpf_map[peek/pop] have a ARG_PTR_TO_UNINIT_MAP_VALUE
argument that is a pointer to a memory zone where to save the value of a
map. Basically the same as ARG_PTR_TO_UNINIT_MEM, but the size has not
be passed as an extra argument.
Our main motivation for implementing queue/stack maps was to keep track
of a pool of elements, like network ports in a SNAT, however we forsee
other use cases, like for exampling saving last N kernel events in a map
and then analysing from userspace.
Signed-off-by: Mauricio Vasquez B <mauricio.vasquez@polito.it>
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
net/sched/cls_api.c has overlapping changes to a call to
nlmsg_parse(), one (from 'net') added rtm_tca_policy instead of NULL
to the 5th argument, and another (from 'net-next') added cb->extack
instead of NULL to the 6th argument.
net/ipv4/ipmr_base.c is a case of a bug fix in 'net' being done to
code which moved (to mr_table_dump)) in 'net-next'. Thanks to David
Ahern for the heads up.
Signed-off-by: David S. Miller <davem@davemloft.net>
KVM/arm updates for 4.20
- Improved guest IPA space support (32 to 52 bits)
- RAS event delivery for 32bit
- PMU fixes
- Guest entry hardening
- Various cleanups
This reverts commit e9837e55b0.
When talking to Maze and Chenbo, we agreed to keep this back by now
due to problems in the ruleset listing path with 32-bit arches.
Signed-off-by: Maciej Żenczykowski <maze@google.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This patch is for VK_KHR_timeline_semaphore extension, semaphore is called syncobj in kernel side:
This extension introduces a new type of syncobj that has an integer payload
identifying a point in a timeline. Such timeline syncobjs support the
following operations:
* CPU query - A host operation that allows querying the payload of the
timeline syncobj.
* CPU wait - A host operation that allows a blocking wait for a
timeline syncobj to reach a specified value.
* Device wait - A device operation that allows waiting for a
timeline syncobj to reach a specified value.
* Device signal - A device operation that allows advancing the
timeline syncobj to a specified value.
v1:
Since it's a timeline, that means the front time point(PT) always is signaled before the late PT.
a. signal PT design:
Signal PT fence N depends on PT[N-1] fence and signal opertion fence, when PT[N] fence is signaled,
the timeline will increase to value of PT[N].
b. wait PT design:
Wait PT fence is signaled by reaching timeline point value, when timeline is increasing, will compare
wait PTs value with new timeline value, if PT value is lower than timeline value, then wait PT will be
signaled, otherwise keep in list. syncobj wait operation can wait on any point of timeline,
so need a RB tree to order them. And wait PT could ahead of signal PT, we need a sumission fence to
perform that.
v2:
1. remove unused DRM_SYNCOBJ_CREATE_TYPE_NORMAL. (Christian)
2. move unexposed denitions to .c file. (Daniel Vetter)
3. split up the change to drm_syncobj_find_fence() in a separate patch. (Christian)
4. split up the change to drm_syncobj_replace_fence() in a separate patch.
5. drop the submission_fence implementation and instead use wait_event() for that. (Christian)
6. WARN_ON(point != 0) for NORMAL type syncobj case. (Daniel Vetter)
v3:
1. replace normal syncobj with timeline implemenation. (Vetter and Christian)
a. normal syncobj signal op will create a signal PT to tail of signal pt list.
b. normal syncobj wait op will create a wait pt with last signal point, and this wait PT is only signaled by related signal point PT.
2. many bug fix and clean up
3. stub fence moving is moved to other patch.
v4:
1. fix RB tree loop with while(node=rb_first(...)). (Christian)
2. fix syncobj lifecycle. (Christian)
3. only enable_signaling when there is wait_pt. (Christian)
4. fix timeline path issues.
5. write a timeline test in libdrm
v5: (Christian)
1. semaphore is called syncobj in kernel side.
2. don't need 'timeline' characters in some function name.
3. keep syncobj cb.
v6: (Christian)
1. merge syncobj_timeline to syncobj structure.
2. simplify some check sentences.
3. some misc change.
4. fix CTS failed issue.
v7: (Christian)
1. error handling when creating signal pt.
2. remove timeline naming in func.
3. export flags in find_fence.
4. allow reset timeline.
v8:
1. use wait_event_interruptible without timeout
2. rename _TYPE_INDIVIDUAL to _TYPE_BINARY
v9:
1. rename signal_pt->base to signal_pt->fence_array to avoid misleading
2. improve kerneldoc
individual syncobj is tested by ./deqp-vk -n dEQP-VK*semaphore*
timeline syncobj is tested by ./amdgpu_test -s 9
Signed-off-by: Chunming Zhou <david1.zhou@amd.com>
Signed-off-by: Christian König <christian.koenig@amd.com>
Cc: Christian Konig <christian.koenig@amd.com>
Cc: Dave Airlie <airlied@redhat.com>
Cc: Daniel Rakos <Daniel.Rakos@amd.com>
Cc: Daniel Vetter <daniel@ffwll.ch>
Cc: Bas Nieuwenhuizen <bas@basnieuwenhuizen.nl>
Cc: Jason Ekstrand <jason@jlekstrand.net>
Reviewed-by: Christian König <christian.koenig@amd.com>
Acked-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Link: https://patchwork.freedesktop.org/patch/257258/
Needed by userspace programs that call fstatfs().
It'd be natural to publish XFS_SB_MAGIC in uapi, but while these two
have identical values, they have different semantic meaning: one is
an enum cookie meant for statfs, the other a signature of the
on-disk format.
Signed-off-by: Adam Borowski <kilobyte@angband.pl>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
The macro PAGE_SIZE isn't valid outside of the kernel, so it should not
appear in UAPI headers.
Furthermore, the actual machine page size could theoretically change from
an application's point of view if it's running in a container that gets
migrated to another machine (say 4K/ppc64 to 64K/ppc64).
Fixes: f2ba5a5bae ("libnvdimm, namespace: make min namespace size 4K")
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
The following code in the linux/ndctl header file:
static inline const char *nvdimm_bus_cmd_name(unsigned cmd)
{
static const char * const names[] = {
[ND_CMD_ARS_CAP] = "ars_cap",
[ND_CMD_ARS_START] = "ars_start",
[ND_CMD_ARS_STATUS] = "ars_status",
[ND_CMD_CLEAR_ERROR] = "clear_error",
[ND_CMD_CALL] = "cmd_call",
};
if (cmd < ARRAY_SIZE(names) && names[cmd])
return names[cmd];
return "unknown";
}
is broken in a number of ways:
(1) ARRAY_SIZE() is not generally defined.
(2) g++ does not support "non-trivial" array initialisers fully yet.
(3) Every file that calls this function will acquire a copy of names[].
The same goes for nvdimm_cmd_name().
Fix all three by converting to a switch statement where each case returns a
string. That way if cmd is a constant, the compiler can trivially reduce it
and, if not, the compiler can use a shared lookup table if it thinks that is
more efficient.
A better way would be to remove these functions and their arrays from the
header entirely.
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
This is a per-VM capability which can be enabled by userspace so that
the faulting linear address will be included with the information
about a pending #PF in L2, and the "new DR6 bits" will be included
with the information about a pending #DB in L2. With this capability
enabled, the L1 hypervisor can now intercept #PF before CR2 is
modified. Under VMX, the L1 hypervisor can now intercept #DB before
DR6 and DR7 are modified.
When userspace has enabled KVM_CAP_EXCEPTION_PAYLOAD, it should
generally provide an appropriate payload when injecting a #PF or #DB
exception via KVM_SET_VCPU_EVENTS. However, to support restoring old
checkpoints, this payload is not required.
Note that bit 16 of the "new DR6 bits" is set to indicate that a debug
exception (#DB) or a breakpoint exception (#BP) occurred inside an RTM
region while advanced debugging of RTM transactional regions was
enabled. This is the reverse of DR6.RTM, which is cleared in this
scenario.
This capability also enables exception.pending in struct
kvm_vcpu_events, which allows userspace to distinguish between pending
and injected exceptions.
Reported-by: Jim Mattson <jmattson@google.com>
Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Requester scatter to CQE is restricted to QPs configured to signal
all WRs.
This patch adds ability to enable scatter to cqe (force enable)
in the requester without sig_all, for users who do not want all WRs
signaled but rather just the ones whose data found in the CQE.
Signed-off-by: Yonatan Cohen <yonatanc@mellanox.com>
Reviewed-by: Guy Levi <guyle@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Enlightened VMCS is opt-in. The current version does not contain all
fields supported by nested VMX so we must not advertise the
corresponding VMX features if enlightened VMCS is enabled.
Userspace is given the enlightened VMCS version supported by KVM as
part of enabling KVM_CAP_HYPERV_ENLIGHTENED_VMCS. The version is to
be advertised to the nested hypervisor, currently done via a cpuid
leaf for Hyper-V.
Suggested-by: Ladi Prosek <lprosek@redhat.com>
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Reviewed-by: Liran Alon <liran.alon@oracle.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Coalesced pio is based on coalesced mmio and can be used for some port
like rtc port, pci-host config port and so on.
Specially in case of rtc as coalesced pio, some versions of windows guest
access rtc frequently because of rtc as system tick. guest access rtc like
this: write register index to 0x70, then write or read data from 0x71.
writing 0x70 port is just as index and do nothing else. So we can use
coalesced pio to handle this scene to reduce VM-EXIT time.
When starting and closing a virtual machine, it will access pci-host config
port frequently. So setting these port as coalesced pio can reduce startup
and shutdown time.
without my patch, get the vm-exit time of accessing rtc 0x70 and piix 0xcf8
using perf tools: (guest OS : windows 7 64bit)
IO Port Access Samples Samples% Time% Min Time Max Time Avg time
0x70:POUT 86 30.99% 74.59% 9us 29us 10.75us (+- 3.41%)
0xcf8:POUT 1119 2.60% 2.12% 2.79us 56.83us 3.41us (+- 2.23%)
with my patch
IO Port Access Samples Samples% Time% Min Time Max Time Avg time
0x70:POUT 106 32.02% 29.47% 0us 10us 1.57us (+- 7.38%)
0xcf8:POUT 1065 1.67% 0.28% 0.41us 65.44us 0.66us (+- 10.55%)
Signed-off-by: Peng Hao <peng.hao2@zte.com.cn>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Using hypercall for sending IPIs is faster because this allows to specify
any number of vCPUs (even > 64 with sparse CPU set), the whole procedure
will take only one VMEXIT.
Current Hyper-V TLFS (v5.0b) claims that HvCallSendSyntheticClusterIpi
hypercall can't be 'fast' (passing parameters through registers) but
apparently this is not true, Windows always uses it as 'fast' so we need
to support that.
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Provide an option to rename IB device name through RDMA netlink and
limit it to users with ADMIN capability only.
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Reviewed-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
According to rfc7496 section 4.3 or 4.4:
sprstat_policy: This parameter indicates for which PR-SCTP policy
the user wants the information. It is an error to use
SCTP_PR_SCTP_NONE in sprstat_policy. If SCTP_PR_SCTP_ALL is used,
the counters provided are aggregated over all supported policies.
We change to dump pr_assoc and pr_stream all status by SCTP_PR_SCTP_ALL
instead, and return error for SCTP_PR_SCTP_NONE, as it also said "It is
an error to use SCTP_PR_SCTP_NONE in sprstat_policy. "
Fixes: 826d253d57 ("sctp: add SCTP_PR_ASSOC_STATUS on sctp sockopt")
Fixes: d229d48d18 ("sctp: add SCTP_PR_STREAM_STATUS sockopt for prsctp")
Reported-by: Ying Xu <yinxu@redhat.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add ttl option support to the nftables "osf" expression.
Signed-off-by: Fernando Fernandez Mancera <ffmancera@riseup.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
A user can provide a hint which will be attached to the packet and written
to the CQE on receive. This can be used as a way to offload operations
into the HW, for example parsing a packet which is a tunneled packet, and
if so, pass 0x1 as the hint. The software can use that hint to decapsulate
the packet and parse only the inner headers thus saving CPU cycles.
Signed-off-by: Mark Bloch <markb@mellanox.com>
Reviewed-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
The new command (NCSI_CMD_SEND_CMD) is added to allow user space application
to send NC-SI command to the network card.
Also, add a new attribute (NCSI_ATTR_DATA) for transferring request and response.
The work flow is as below.
Request:
User space application
-> Netlink interface (msg)
-> new Netlink handler - ncsi_send_cmd_nl()
-> ncsi_xmit_cmd()
Response:
Response received - ncsi_rcv_rsp()
-> internal response handler - ncsi_rsp_handler_xxx()
-> ncsi_rsp_handler_netlink()
-> ncsi_send_netlink_rsp ()
-> Netlink interface (msg)
-> user space application
Command timeout - ncsi_request_timeout()
-> ncsi_send_netlink_timeout ()
-> Netlink interface (msg with zero data length)
-> user space application
Error:
Error detected
-> ncsi_send_netlink_err ()
-> Netlink interface (err msg)
-> user space application
Signed-off-by: Justin Lee <justin.lee1@dell.com>
Reviewed-by: Samuel Mendoza-Jonas <sam@mendozajonas.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add support for the DEC FDDIcontroller 700 (DEFZA), Digital Equipment
Corporation's first-generation FDDI network interface adapter, made for
TURBOchannel and based on a discrete version of what eventually became
Motorola's widely used CAMEL chipset.
The CAMEL chipset is present for example in the DEC FDDIcontroller
TURBOchannel, EISA and PCI adapters (DEFTA/DEFEA/DEFPA) that we support
with the `defxx' driver, however the host bus interface logic and the
firmware API are different in the DEFZA and hence a separate driver is
required.
There isn't much to say about the driver except that it works, but there
is one peculiarity to mention. The adapter implements two Tx/Rx queue
pairs.
Of these one pair is the usual network Tx/Rx queue pair, in this case
used by the adapter to exchange frames with the ring, via the RMC (Ring
Memory Controller) chip. The Tx queue is handled directly by the RMC
chip and resides in onboard packet memory. The Rx queue is maintained
via DMA in host memory by adapter's firmware copying received data
stored by the RMC in onboard packet memory.
The other pair is used to communicate SMT frames with adapter's
firmware. Any SMT frame received from the RMC via the Rx queue must be
queued back by the driver to the SMT Rx queue for the firmware to
process. Similarly the firmware uses the SMT Tx queue to supply the
driver with SMT frames that must be queued back to the Tx queue for the
RMC to send to the ring.
This solution was chosen because the designers ran out of PCB space and
could not squeeze in more logic onto the board that would be required to
handle this SMT frame traffic without the need to involve the driver, as
with the later DEFTA/DEFEA/DEFPA adapters.
Finally the driver does some Frame Control byte decoding, so to avoid
magic numbers some macros are added to <linux/if_fddi.h>.
Signed-off-by: Maciej W. Rozycki <macro@linux-mips.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
We get a warning from 'make headers_check' about a newly introduced usage
of integer types in the scsi/scsi_bsg_ufs.h uapi header:
usr/include/scsi/scsi_bsg_ufs.h:18: found __[us]{8,16,32,64} type without #include <linux/types.h>
Aside from the missing linux/types.h inclusion, I also noticed that it
uses the wrong types: 'u32' is not available at all in user space, and
'uint32_t' depends on the inclusion of a standard header that we should
not include from kernel headers.
Change the all to __u32 and similar types here.
I also note the usage of '__be32' and '__be16' that seems unfortunate for
a user space API. I wonder if it would be better to define the interface
in terms of a CPU-endian structure and convert it in kernel space.
Fixes: e77044c5a8 ("scsi: ufs-bsg: Add support for uic commands in ufs_bsg_request()")
Fixes: df032bf27a ("scsi: ufs: Add a bsg endpoint that supports UPIUs")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Avri Altman <avri.altman@wdc.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
FUSE file reads are cached in the page cache, but symlink reads are
not. This patch enables FUSE READLINK operations to be cached which
can improve performance of some FUSE workloads.
In particular, I'm working on a FUSE filesystem for access to source
code and discovered that about a 10% improvement to build times is
achieved with this patch (there are a lot of symlinks in the source
tree).
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Conflicts were easy to resolve using immediate context mostly,
except the cls_u32.c one where I simply too the entire HEAD
chunk.
Signed-off-by: David S. Miller <davem@davemloft.net>
Johannes Berg says:
====================
Highlights:
* merge net-next, so I can finish the hwsim workqueue removal
* fix TXQ NULL pointer issue that was reported multiple times
* minstrel cleanups from Felix
* simplify lib80211 code by not using skcipher, note that this
will conflict with the crypto tree (and this new code here
should be used)
* use new netlink policy validation in nl80211
* fix up SAE (part of WPA3) in client-mode
* FTM responder support in the stack
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds an option to have per-port vlan stats instead of the
default global stats. The option can be set only when there are no port
vlans in the bridge since we need to allocate the stats if it is set
when vlans are being added to ports (and respectively free them
when being deleted). Also bump RTNL_MAX_TYPE as the bridge is the
largest user of options. The current stats design allows us to add
these without any changes to the fast-path, it all comes down to
the per-vlan stats pointer which, if this option is enabled, will
be allocated for each port vlan instead of using the global bridge-wide
one.
CC: bridge@lists.linux-foundation.org
CC: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add support for drivers to report the total number of MPDUs received
and the number of MPDUs received with an FCS error from a specific
peer. These counters will be incremented only when the TA of the
frame matches the MAC address of the peer irrespective of FCS
error.
It should be noted that the TA field in the frame might be corrupted
when there is an FCS error and TA matching logic would fail in such
cases. Hence, FCS error counter might not be fully accurate, but it can
provide help in detecting bad RX links in significant number of cases.
This FCS error counter without full accuracy can be used, e.g., to
trigger a kick-out of a connected client with a bad link in AP mode to
force such a client to roam to another AP.
Signed-off-by: Ankita Bajaj <bankita@codeaurora.org>
Signed-off-by: Jouni Malinen <jouni@codeaurora.org>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
This allows to set EDID monitor information for the vgpu display, for a
more flexible display configuration, using a special vfio region. Check
the comment describing struct vfio_region_gfx_edid for more details.
Signed-off-by: Gerd Hoffmann <kraxel@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
David Howells says:
====================
rxrpc: Fix packet reception code
Here are a set of patches that prepares for and fix problems in rxrpc's
package reception code. There serious problems are:
(A) There's a window between binding the socket and setting the data_ready
hook in which packets can find their way into the UDP socket's receive
queues.
(B) The skb_recv_udp() will return an error (and clear the error state) if
there was an error on the Tx side. rxrpc doesn't handle this.
(C) The rxrpc data_ready handler doesn't fully drain the UDP receive
queue.
(D) The rxrpc data_ready handler assumes it is called in a non-reentrant
state.
The second patch fixes (A) - (C); the third patch renders (B) and (C)
non-issues by using the recap_rcv hook instead of data_ready - and the
final patch fixes (D). That last is the most complex.
The preparatory patches are:
(1) Fix some places that are doing things in the wrong net namespace.
(2) Stop taking the rcu read lock as it's held by the IP input routine in
the call chain.
(3) Only end the Tx phase if *we* rotated the final packet out of the Tx
buffer.
(4) Don't assume that the call state won't change after dropping the
call_state lock.
(5) Only take receive window and MTU suze parameters from an ACK packet if
it's the latest ACK packet.
(6) Record connection-level abort information correctly.
(7) Fix a trace line.
And then there are three main patches - note that these are mixed in with
the preparatory patches somewhat:
(1) Fix the setup window (A), skb_recv_udp() error check (B) and packet
drainage (C).
(2) Switch to using the encap_rcv instead of data_ready to cut out the
effects of the UDP read queues and get the packets delivered directly.
(3) Add more locking into the various packet input paths to defend against
re-entrance (D).
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
For now, just provide an API to allocate and remove ufs-bsg node. We
will use this framework to manage ufs devices by sending UPIU
transactions.
For the time being, implements an empty bsg_request() - will add some
more functionality in coming patches.
Nonetheless, we reveal here the protocol we are planning to use: UFS
Transport Protocol Transactions. UFS transactions consist of packets
called UFS Protocol Information Units (UPIU).
There are UPIU’s defined for UFS SCSI commands, responses, data in and
data out, task management, utility functions, vendor functions,
transaction synchronization and control, and more.
By using UPIUs, we get access to the most fine-grained internals of this
protocol, and able to communicate with the device in ways, that are
sometimes beyond the capacity of the ufs driver.
Moreover and as a result, our core structure - ufs_bsg_node has a pretty
lean structure: using upiu transactions that contains the outmost
detailed info, so we don't really need complex constructs to support it.
Signed-off-by: Avri Altman <avri.altman@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <Bart.VanAssche@wdc.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Alexei Starovoitov says:
====================
pull-request: bpf-next 2018-10-08
The following pull-request contains BPF updates for your *net-next* tree.
The main changes are:
1) sk_lookup_[tcp|udp] and sk_release helpers from Joe Stringer which allow
BPF programs to perform lookups for sockets in a network namespace. This would
allow programs to determine early on in processing whether the stack is
expecting to receive the packet, and perform some action (eg drop,
forward somewhere) based on this information.
2) per-cpu cgroup local storage from Roman Gushchin.
Per-cpu cgroup local storage is very similar to simple cgroup storage
except all the data is per-cpu. The main goal of per-cpu variant is to
implement super fast counters (e.g. packet counters), which don't require
neither lookups, neither atomic operations in a fast path.
The example of these hybrid counters is in selftests/bpf/netcnt_prog.c
3) allow HW offload of programs with BPF-to-BPF function calls from Quentin Monnet
4) support more than 64-byte key/value in HW offloaded BPF maps from Jakub Kicinski
5) rename of libbpf interfaces from Andrey Ignatov.
libbpf is maturing as a library and should follow good practices in
library design and implementation to play well with other libraries.
This patch set brings consistent naming convention to global symbols.
6) relicense libbpf as LGPL-2.1 OR BSD-2-Clause from Alexei Starovoitov
to let Apache2 projects use libbpf
7) various AF_XDP fixes from Björn and Magnus
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
This adds a KVM_PPC_NO_HASH flag to the flags field of the
kvm_ppc_smmu_info struct, and arranges for it to be set when
running as a nested hypervisor, as an unambiguous indication
to userspace that HPT guests are not supported. Reporting the
KVM_CAP_PPC_MMU_HASH_V3 capability as false could be taken as
indicating only that the new HPT features in ISA V3.0 are not
supported, leaving it ambiguous whether pre-V3.0 HPT features
are supported.
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
With this, userspace can enable a KVM-HV guest to run nested guests
under it.
The administrator can control whether any nested guests can be run;
setting the "nested" module parameter to false prevents any guests
becoming nested hypervisors (that is, any attempt to enable the nested
capability on a guest will fail). Guests which are already nested
hypervisors will continue to be so.
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Pablo Neira Ayuso says:
====================
Netfilter updates for net-next
The following patchset contains Netfilter updates for your net-next tree:
1) Support for matching on ipsec policy already set in the route, from
Florian Westphal.
2) Split set destruction into deactivate and destroy phase to make it
fit better into the transaction infrastructure, also from Florian.
This includes a patch to warn on imbalance when setting the new
activate and deactivate interfaces.
3) Release transaction list from the workqueue to remove expensive
synchronize_rcu() from configuration plane path. This speeds up
configuration plane quite a bit. From Florian Westphal.
4) Add new xfrm/ipsec extension, this new extension allows you to match
for ipsec tunnel keys such as source and destination address, spi and
reqid. From Máté Eckl and Florian Westphal.
5) Add secmark support, this includes connsecmark too, patches
from Christian Gottsche.
6) Allow to specify remaining bytes in xt_quota, from Chenbo Feng.
One follow up patch to calm a clang warning for this one, from
Nathan Chancellor.
7) Flush conntrack entries based on layer 3 family, from Kristian Evensen.
8) New revision for cgroups2 to shrink the path field.
9) Get rid of obsolete need_conntrack(), as a result from recent
demodularization works.
10) Use WARN_ON instead of BUG_ON, from Florian Westphal.
11) Unused exported symbol in nf_nat_ipv4_fn(), from Florian.
12) Remove superfluous check for timeout netlink parser and dump
functions in layer 4 conntrack helpers.
13) Unnecessary redundant rcu read side locks in NAT redirect,
from Taehee Yoo.
14) Pass nf_hook_state structure to error handlers, patch from
Florian Westphal.
15) Remove ->new() interface from layer 4 protocol trackers. Place
them in the ->packet() interface. From Florian.
16) Place conntrack ->error() handling in the ->packet() interface.
Patches from Florian Westphal.
17) Remove unused parameter in the pernet initialization path,
also from Florian.
18) Remove additional parameter to specify layer 3 protocol when
looking up for protocol tracker. From Florian.
19) Shrink array of layer 4 protocol trackers, from Florian.
20) Check for linear skb only once from the ALG NAT mangling
codebase, from Taehee Yoo.
21) Use rhashtable_walk_enter() instead of deprecated
rhashtable_walk_init(), also from Taehee.
22) No need to flush all conntracks when only one single address
is gone, from Tan Hu.
23) Remove redundant check for NAT flags in flowtable code, from
Taehee Yoo.
24) Use rhashtable_lookup() instead of rhashtable_lookup_fast()
from netfilter codebase, since rcu read lock side is already
assumed in this path.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Add a new socket option, NETLINK_DUMP_STRICT_CHK, that userspace
can use via setsockopt to request strict checking of headers and
attributes on dump requests.
To get dump features such as kernel side filtering based on data in
the header or attributes appended to the dump request, userspace
must call setsockopt() for NETLINK_DUMP_STRICT_CHK and a non-zero
value. Since the netlink sock and its flags are private to the
af_netlink code, the strict checking flag is passed to dump handlers
via a flag in the netlink_callback struct.
For old userspace on new kernel there is no impact as all of the data
checks in later patches are wrapped in a check on the new strict flag.
For new userspace on old kernel, the setsockopt will fail and even if
new userspace sets data in the headers and appended attributes the
kernel will silently ignore it. Moving forward when the setsockopt
succeeds, the new userspace on old kernel means the dump request can
pass an attribute the kernel does not understand. The dump will then
fail as the older kernel does not understand it.
New userspace on new kernel setting the socket option gets the benefit
of the improved data dump.
Kernel side the NETLINK_DUMP_STRICT_CHK uapi is converted to a generic
NETLINK_F_STRICT_CHK flag which can potentially be leveraged for tighter
checking on the NEW, DEL, and SET commands.
Signed-off-by: David Ahern <dsahern@gmail.com>
Acked-by: Christian Brauner <christian@brauner.io>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use the UDP encap_rcv hook to cut the bit out of the rxrpc packet reception
in which a packet is placed onto the UDP receive queue and then immediately
removed again by rxrpc. Going via the queue in this manner seems like it
should be unnecessary.
This does, however, require the invention of a value to place in encap_type
as that's one of the conditions to switch packets out to the encap_rcv
hook. Possibly the value doesn't actually matter for anything other than
sockopts on the UDP socket, which aren't accessible outside of rxrpc
anyway.
This seems to cut a bit of time out of the time elapsed between each
sk_buff being timestamped and turning up in rxrpc (the final number in the
following trace excerpts). I measured this by making the rxrpc_rx_packet
trace point print the time elapsed between the skb being timestamped and
the current time (in ns), e.g.:
... 424.278721: rxrpc_rx_packet: ... ACK 25026
So doing a 512MiB DIO read from my test server, with an unmodified kernel:
N min max sum mean stddev
27605 2626 7581 7.83992e+07 2840.04 181.029
and with the patch applied:
N min max sum mean stddev
27547 1895 12165 6.77461e+07 2459.29 255.02
Signed-off-by: David Howells <dhowells@redhat.com>