Even though libbpf's versioning script for the linker (libbpf.map)
is pointing to 0.0.2, the BPF_EXTRAVERSION in the Makefile has
not been updated along with it and is therefore still on 0.0.1.
While fixing up, I also noticed that the generated shared object
versioning information is missing, typical convention is to have
a linker name (libbpf.so), soname (libbpf.so.0) and real name
(libbpf.so.0.0.2) for library management. This is based upon the
LIBBPF_VERSION as well.
The build will then produce the following bpf libraries:
# ll libbpf*
libbpf.a
libbpf.so -> libbpf.so.0.0.2
libbpf.so.0 -> libbpf.so.0.0.2
libbpf.so.0.0.2
# readelf -d libbpf.so.0.0.2 | grep SONAME
0x000000000000000e (SONAME) Library soname: [libbpf.so.0]
And install them accordingly:
# rm -rf /tmp/bld; mkdir /tmp/bld; make -j$(nproc) O=/tmp/bld install
Auto-detecting system features:
... libelf: [ on ]
... bpf: [ on ]
CC /tmp/bld/libbpf.o
CC /tmp/bld/bpf.o
CC /tmp/bld/nlattr.o
CC /tmp/bld/btf.o
CC /tmp/bld/libbpf_errno.o
CC /tmp/bld/str_error.o
CC /tmp/bld/netlink.o
CC /tmp/bld/bpf_prog_linfo.o
CC /tmp/bld/libbpf_probes.o
CC /tmp/bld/xsk.o
LD /tmp/bld/libbpf-in.o
LINK /tmp/bld/libbpf.a
LINK /tmp/bld/libbpf.so.0.0.2
LINK /tmp/bld/test_libbpf
INSTALL /tmp/bld/libbpf.a
INSTALL /tmp/bld/libbpf.so.0.0.2
# ll /usr/local/lib64/libbpf.*
/usr/local/lib64/libbpf.a
/usr/local/lib64/libbpf.so -> libbpf.so.0.0.2
/usr/local/lib64/libbpf.so.0 -> libbpf.so.0.0.2
/usr/local/lib64/libbpf.so.0.0.2
Fixes: 1bf4b05810 ("tools: bpftool: add probes for eBPF program types")
Fixes: 1b76c13e4b ("bpf tools: Introduce 'bpf' library and add bpf feature check")
Reported-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
These three variables are set in one branch and used in another with
the same condition. But on some architectures they still generate
compiler warnings of the kind:
warning: 'inner_trans' may be used uninitialized in this function [-Wmaybe-uninitialized]
Silence these false positives. Use the straightforward approach to
always initialize them, if a bit superfluous.
Fixes: 868d523535 ("bpf: add bpf_skb_adjust_room encap flags")
Reported-by: kbuild test robot <lkp@intel.com>
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Jiri Pirko says:
====================
devlink: small spring cleanup
Mostly cosmetics and janitor work.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Some drivers are becoming more dependent on NET_DEVLINK being selected
in configuration. With upcoming compat functions, the behavior would be
wrong in case devlink was not compiled in. So make the drivers select
NET_DEVLINK and rely on the functions being there, not just stubs.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add spinlock to protect port type and type_dev pointer consistency.
Without that, userspace may see inconsistent type and type_dev
combinations.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
v1->v2:
- rebased
Signed-off-by: David S. Miller <davem@davemloft.net>
Port needs to be registered first before the type is set. Warn and
bail-out in case it is not.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Move the type set of devlink port after it is registered.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Similar to other driver, move the port type set after netdev registration
is done. Along with that, clear the type before unregistration.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since the port attributes are static and cannot change during the port
lifetime, WARN_ON if some driver calls it after registration. Also, no
need to call notifications as it is noop anyway due to check of
devlink_port->registered there.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since attrs are static during the existence of devlink port, set the
before registration of the port.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since attrs are static during the existence of devlink port, set the
before registration of the port.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
__devlink_port_type_set() returns void, it makes no sense to pass it on,
so don't do that.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The netdevice is guaranteed to not disappear so we can rely that
devlink_port and devlink won't disappear as well. No need to take
devlink_mutex so don't take it here.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Call devlink_port_type_eth_set() before devlink_port_register(). Bnxt
instances won't change type during lifetime. This avoids one extra
userspace devlink notification.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Set the attrs properly so delink has enough info to generate physical
port names.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
devlink functions are in use, so include the related header file.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
devlink functions are in use, so include the related header file.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add missing called to mutex_destroy() for two mutexes used
in devlink code.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull scheduler updates from Thomas Gleixner:
"Third more careful attempt for this set of fixes:
- Prevent a 32bit math overflow in the cpufreq code
- Fix a buffer overflow when scanning the cgroup2 cpu.max property
- A set of fixes for the NOHZ scheduler logic to prevent waking up
CPUs even if the capacity of the busy CPUs is sufficient along with
other tweaks optimizing the behaviour for asymmetric systems
(big/little)"
* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/fair: Skip LLC NOHZ logic for asymmetric systems
sched/fair: Tune down misfit NOHZ kicks
sched/fair: Comment some nohz_balancer_kick() kick conditions
sched/core: Fix buffer overflow in cgroup2 property cpu.max
sched/cpufreq: Fix 32-bit math overflow
Pull perf updates from Thomas Gleixner:
"A larger set of perf updates.
Not all of them are strictly fixes, but that's solely the tip
maintainers fault as they let the timely -rc1 pull request fall
through the cracks for various reasons including travel. So I'm
sending this nevertheless because rebasing and distangling fixes and
updates would be a mess and risky as well. As of tomorrow, a strict
fixes separation is happening again. Sorry for the slip-up.
Kernel:
- Handle RECORD_MMAP vs. RECORD_MMAP2 correctly so different
consumers of the mmap event get what they requested.
Tools:
- A larger set of updates to perf record/report/scripts vs. time
stamp handling
- More Python3 fixups
- A pile of memory leak plumbing
- perf BPF improvements and fixes
- Finalize the perf.data directory storage"
[ Note: the kernel part is strictly a fix, the updates are purely to
tooling - Linus ]
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (75 commits)
perf bpf: Show more BPF program info in print_bpf_prog_info()
perf bpf: Extract logic to create program names from perf_event__synthesize_one_bpf_prog()
perf tools: Save bpf_prog_info and BTF of new BPF programs
perf evlist: Introduce side band thread
perf annotate: Enable annotation of BPF programs
perf build: Check what binutils's 'disassembler()' signature to use
perf bpf: Process PERF_BPF_EVENT_PROG_LOAD for annotation
perf symbols: Introduce DSO_BINARY_TYPE__BPF_PROG_INFO
perf feature detection: Add -lopcodes to feature-libbfd
perf top: Add option --no-bpf-event
perf bpf: Save BTF information as headers to perf.data
perf bpf: Save BTF in a rbtree in perf_env
perf bpf: Save bpf_prog_info information as headers to perf.data
perf bpf: Save bpf_prog_info in a rbtree in perf_env
perf bpf: Make synthesize_bpf_events() receive perf_session pointer instead of perf_tool
perf bpf: Synthesize bpf events with bpf_program__get_prog_info_linear()
bpftool: use bpf_program__get_prog_info_linear() in prog.c:do_dump()
tools lib bpf: Introduce bpf_program__get_prog_info_linear()
perf record: Replace option --bpf-event with --no-bpf-event
perf tests: Fix a memory leak in test__perf_evsel__tp_sched_test()
...
Pull x86 fixes from Thomas Gleixner:
"A set of x86 fixes:
- Prevent potential NULL pointer dereferences in the HPET and HyperV
code
- Exclude the GART aperture from /proc/kcore to prevent kernel
crashes on access
- Use the correct macros for Cyrix I/O on Geode processors
- Remove yet another kernel address printk leak
- Announce microcode reload completion as requested by quite some
people. Microcode loading has become popular recently.
- Some 'Make Clang' happy fixlets
- A few cleanups for recently added code"
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/gart: Exclude GART aperture from kcore
x86/hw_breakpoints: Make default case in hw_breakpoint_arch_parse() return an error
x86/mm/pti: Make local symbols static
x86/cpu/cyrix: Remove {get,set}Cx86_old macros used for Cyrix processors
x86/cpu/cyrix: Use correct macros for Cyrix calls on Geode processors
x86/microcode: Announce reload operation's completion
x86/hyperv: Prevent potential NULL pointer dereference
x86/hpet: Prevent potential NULL pointer dereference
x86/lib: Fix indentation issue, remove extra tab
x86/boot: Restrict header scope to make Clang happy
x86/mm: Don't leak kernel addresses
x86/cpufeature: Fix various quality problems in the <asm/cpu_device_hd.h> header
Pull timer fixes from Thomas Gleixner:
"A set of small fixes plus the removal of stale board support code:
- Remove the board support code from the clpx711x clocksource driver.
This change had fallen through the cracks and I'm sending it now
rather than dealing with people who want to improve that stale code
for 3 month.
- Use the proper clocksource mask on RICSV
- Make local scope functions and variables static"
* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
clocksource/drivers/clps711x: Remove board support
clocksource/drivers/riscv: Fix clocksource mask
clocksource/drivers/mips-gic-timer: Make gic_compare_irqaction static
clocksource/drivers/timer-ti-dm: Make omap_dm_timer_set_load_start() static
clocksource/drivers/tcb_clksrc: Make tc_clksrc_suspend/resume() static
clocksource/drivers/clps711x: Make clps711x_clksrc_init() static
time/jiffies: Make refined_jiffies static
Pull locking fixes from Thomas Gleixner:
"Two small fixes:
- Cure a recently introduces error path hickup which tries to
unregister a not registered lockdep key in te workqueue code
- Prevent unaligned cmpxchg() crashes in the robust list handling
code by sanity checking the user space supplied futex pointer"
* 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
futex: Ensure that futex address is aligned in handle_futex_death()
workqueue: Only unregister a registered lockdep key
Pull irq fixes from Thomas Gleixner:
"A set of fixes for the interrupt subsystem:
- Remove secondary GIC support on systems w/o device-tree support
- A set of small fixlets in various irqchip drivers
- static and fall-through annotations
- Kernel doc and typo fixes"
* 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
genirq: Mark expected switch case fall-through
genirq/devres: Remove excess parameter from kernel doc
irqchip/irq-mvebu-sei: Make mvebu_sei_ap806_caps static
irqchip/mbigen: Don't clear eventid when freeing an MSI
irqchip/stm32: Don't set rising configuration registers at init
irqchip/stm32: Don't clear rising/falling config registers at init
dt-bindings: irqchip: renesas-irqc: Document r8a774c0 support
irqchip/mmp: Make mmp_irq_domain_ops static
irqchip/brcmstb-l2: Make two init functions static
genirq: Fix typo in comment of IRQD_MOVE_PCNTXT
irqchip/gic-v3-its: Fix comparison logic in lpi_range_cmp
irqchip/gic: Drop support for secondary GIC in non-DT systems
irqchip/imx-irqsteer: Fix of_property_read_u32() error handling
Pull core fixes from Thomas Gleixner:
"Two small fixes:
- Move the large objtool_file struct off the stack so objtool works
in setups with a tight stack limit.
- Make a few variables static in the watchdog core code"
* 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
watchdog/core: Make variables static
objtool: Move objtool_file struct off the stack
Pull thermal management fixes from Zhang Rui:
- Fix a wrong __percpu structure declaration in intel_powerclamp driver
(Luc Van Oostenryck)
- Fix truncated name of the idle injection kthreads created by
intel_powerclamp driver (Zhang Rui)
- Fix the missing UUID supports in int3400 thermal driver (Matthew
Garrett)
- Fix a crash when accessing the debugfs of bcm2835 SoC thermal driver
(Phil Elwell)
- A couple of trivial fixes/cleanups in some SoC thermal drivers
* 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/rzhang/linux:
thermal/intel_powerclamp: fix truncated kthread name
thermal: mtk: Allocate enough space for mtk_thermal.
thermal/int340x_thermal: fix mode setting
thermal/int340x_thermal: Add additional UUIDs
thermal: cpu_cooling: Remove unused cur_freq variable
thermal: bcm2835: Fix crash in bcm2835_thermal_debugfs
thermal: samsung: Fix incorrect check after code merge
thermal/intel_powerclamp: fix __percpu declaration of worker_data
-----BEGIN PGP SIGNATURE-----
iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAlyWPM0ACgkQiiy9cAdy
T1FKCAwAsrZ2WlTdSp5/1Ogwa18vrS5dMHnMipOaZytG6HxBDXPGDzohNQxHbLQK
ShxQIrcPoVmB6WVrzpEYrPGamDETogp+ennVHwngTNDP7TN/U/oSVzBSJ/ZW32uO
w6LSXm3upVNluQLalLhy95xRUhZrt/FkCGp8BkTduR9VObfDtSHouCvsdMl0gXL1
qHZM1LguJsB0ziWNQpeXvFar63NO5bJEBtvP+sc+sGjuoRskE6Bz68GCg3t+Mbp+
73M/iVil8HWMHZELs1JwPslakbp1xDAcz7fjcizrVXcMaW8xqT68rylORC4aY3p3
bNqDTCPRjufE0ktjSi8ld8g9W3W/TbyizBVUDUHYqDzRomx5sx74vz57yCdqyKZX
/MENHMcrQ8OcB1OXKKTtgbh3dEcwSbrKHKuO+0XMz4We1S5Py8qRRTVxissqxWx6
aSeXYDC2XLlsPO9/OY6h08pQ21i7qO8wbuCbmpunaMgnk1vurF0B5Xt8s27FSoEK
M32p50xP
=Q90v
-----END PGP SIGNATURE-----
Merge tag '5.1-rc1-cifs-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6
Pull smb3 fixes from Steve French:
- two fixes for stable for guest mount problems with smb3.1.1
- two fixes for crediting (SMB3 flow control) on resent requests
- a byte range lock leak fix
- two fixes for incorrect rc mappings
* tag '5.1-rc1-cifs-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6:
cifs: update internal module version number
SMB3: Fix SMB3.1.1 guest mounts to Samba
cifs: Fix slab-out-of-bounds when tracing SMB tcon
cifs: allow guest mounts to work for smb3.11
fix incorrect error code mapping for OBJECTID_NOT_FOUND
cifs: fix that return -EINVAL when do dedupe operation
CIFS: Fix an issue with re-sending rdata when transport returning -EAGAIN
CIFS: Fix an issue with re-sending wdata when transport returning -EAGAIN
- Series to fix a memory leak in hd44780
while introducing charlcd_free().
From Andy Shevchenko
- Series to clean up the Kconfig menus and
a couple of improvements for charlcd.
From Mans Rullgard
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEPjU5OPd5QIZ9jqqOGXyLc2htIW0FAlyWPZIACgkQGXyLc2ht
IW1Jkg/8DJPkAjGqmhSMXesHj/M4+0sIRq3MDLFoZTLBzWfalCCUeu/f8qun0Eb3
kGHRHuCfu1tJF6EP/k4UcPLvsmrh9iKRc/GYcgyHyuOxxkOhmyxd890FwdpZb5ev
Mawv0otK8Helv4W8hxQhO+BN1R2qPVhM7ddQ3dX2190N46Q3SzAvie3itdNzwF6C
voT9XfHXjqenSA/xjEZbTfF18zndlGE+inbouyw6vT/O4dIu++ljUVNLOahvUXPt
tEiZ3buwfIuF0tFnVllmSJToGlEJ47fdCZ0IP+eK0ay5DzhdDayimd/OdZPHCHL/
DY2Cvp5ugB6BwZLLSVpIGKNL/GAnA4686+zy3uRykCs1eDZsM2gSYe3Xat2aQCpy
101GUfgym+3VFtv+wkYk83Lxr2cCPOc5a700xzx9GSDtXAF+MY+Ahi8UoKnOa9lg
48MKgXMlCN69/Hd/LWQmc8qVAx3AQQxlWroKV8hK9N1ZRklY4J5ZhNkQjfUhO2FF
6U8+s6pq3s2Im3AKvsCoAxWZa+sBPJhH+yYWszxfqap+k8p4Ligku3iQgoM0OaMI
mxgv5cSZWeuO/tG7WTiJAuc0KgZen2w04iBYOYL6jyhMdzLjfs24CrJU9y0bLKFd
GY7jkxDuv6TsS5GqHalDepXyk1Y41C1jaddz1AcUvBqyVTRPjj0=
=MO2/
-----END PGP SIGNATURE-----
Merge tag 'auxdisplay-for-linus-v5.1-rc2' of git://github.com/ojeda/linux
Pull auxdisplay updates from Miguel Ojeda:
"A few fixes and improvements for auxdisplay:
- Series to fix a memory leak in hd44780 while introducing
charlcd_free(). From Andy Shevchenko
- Series to clean up the Kconfig menus and a couple of improvements
for charlcd. From Mans Rullgard"
* tag 'auxdisplay-for-linus-v5.1-rc2' of git://github.com/ojeda/linux:
auxdisplay: charlcd: make backlight initial state configurable
auxdisplay: charlcd: simplify init message display
auxdisplay: deconfuse configuration
auxdisplay: hd44780: Convert to use charlcd_free()
auxdisplay: panel: Convert to use charlcd_free()
auxdisplay: charlcd: Introduce charlcd_free() helper
auxdisplay: charlcd: Move to_priv() to charlcd namespace
auxdisplay: hd44780: Fix memory leak on ->remove()
Six fixes to four drivers and two core fixes. One core fix simply
corrects a missed destroy_rcu_head() but the other is hopefully the
end of an ongoing effort to make suspend/resume play nicely with scsi
quiesce.
Signed-off-by: James E.J. Bottomley <jejb@linux.ibm.com>
-----BEGIN PGP SIGNATURE-----
iJwEABMIAEQWIQTnYEDbdso9F2cI+arnQslM7pishQUCXJZbZCYcamFtZXMuYm90
dG9tbGV5QGhhbnNlbnBhcnRuZXJzaGlwLmNvbQAKCRDnQslM7pishaIGAP9dOI8E
7RJRVPzXDEq7A2YE8whsWiQw7lB/BwMIYQenIwD/YySqYO8hTIDzXzkCuR94OY5L
HTWvluKkLILX/wAFM50=
=ItVl
-----END PGP SIGNATURE-----
Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
Pull SCSI fixes from James Bottomley:
"Six fixes to four drivers and two core fixes.
One core fix simply corrects a missed destroy_rcu_head() but the other
is hopefully the end of an ongoing effort to make suspend/resume play
nicely with scsi quiesce"
* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
scsi: ibmvscsi: Fix empty event pool access during host removal
scsi: ibmvscsi: Protect ibmvscsi_head from concurrent modificaiton
scsi: hisi_sas: Add softreset in hisi_sas_I_T_nexus_reset()
scsi: qla2xxx: Fix NULL pointer crash due to stale CPUID
scsi: qla2xxx: Fix FC-AL connection target discovery
scsi: core: Avoid that a kernel warning appears during system resume
scsi: core: Also call destroy_rcu_head() for passthrough requests
scsi: iscsi: flush running unbind operations when removing a session
Since board support for the CLPS711X platform was removed,
remove the board support from the clps711x-timer driver.
Signed-off-by: Alexander Shiyan <shc_work@mail.ru>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Link: https://lkml.kernel.org/r/20181220111626.17140-1-shc_work@mail.ru
Igor Russkikh says:
====================
net: aquantia: RX performance optimization patches
Here is a set of patches targeting for performance improvement
on various platforms and protocols.
Our main target was rx performance on iommu systems, notably
NVIDIA Jetson TX2 and NVIDIA Xavier platforms.
We introduce page reuse strategy to better deal with iommu dma mapping costs.
With it we see 80-90% of page reuse under some test configurations on UDP traffic.
This shows good improvements on other systems with IOMMU hardware, like
AMD Ryzen.
We've also improved TCP LRO configuration parameters, allowing packets to better
coalesce.
Page reuse tests were carried out using iperf3, iperf2, netperf and pktgen.
Mainly on UDP traffic, with various packet lengths.
Jetson TX2, UDP, Default MTU:
RX Lost Datagrams
Before: Max: 69% Min: 68% Avg: 68.5%
After: Max: 41% Min: 38% Avg: 39.2%
Maximum throughput
Before: 1.27 Gbits/sec
After: 2.41 Gbits/sec
AMD Ryzen 5 2400G, UDP, Default MTU:
RX Lost Datagrams
Before: Max: 12% Min: 4.5% Avg: 7.17%
After: Max: 6.2% Min: 2.3% Avg: 4.26%
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
The driver is now constantly tested in our lab on aarch64 hardware:
Jetson tx2, Pascal and Xavier tegra based hardware.
Many of tegra smmu related HW bugs were fixed or workarounded already.
Thus, add ARM64 into Kconfig.
Add also COMPILE_TEST dependency.
Signed-off-by: Igor Russkikh <igor.russkikh@aquantia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Default LRO HW configuration was very conservative.
Low Number of Descriptors per LRO Sequence, small session
timeout, inefficient settings in interrupt generation logic.
Change max number of LRO descriptors from 2 to 16 to
increase performance. Increase maximum coalescing interval
in HW to 250uS. Tune up HW LRO interrupt generation setting
to prevent hw issues with long LRO sessions.
Signed-off-by: Nikita Danilov <nikita.danilov@aquantia.com>
Signed-off-by: Igor Russkikh <igor.russkikh@aquantia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
For multigig rates 1K ring size is often not enough and causes extra
packet drops in hardware.
Signed-off-by: Igor Russkikh <igor.russkikh@aquantia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This correlates with default internet MTU. This also allows page
flip/reuse to be activated, since each allocated RX page now serves for
two frags/packets.
Signed-off-by: Igor Russkikh <igor.russkikh@aquantia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Before that, we've refilled ring even on single descriptor move.
Under high packet load that caused page allocation logic to be triggered
too often. That made overall ring processing slower.
Moreover, with page buffer reuse implemented, we should give a chance
higher networking levels to process received packets faster, release
the pages they consumed and therefore give a higher chance for these
pages to be reused.
RX ring is now refilled only when AQ_CFG_RX_REFILL_THRES or more
descriptors were processed (32 by default). Under regular traffic this
gives quite enough time for packet to be consumed and page to be reused.
Signed-off-by: Igor Russkikh <igor.russkikh@aquantia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We introduce internal aq_rxpage wrapper over regular page
where extra field is tracked: rxpage offset inside of allocated page.
This offset allows to reuse one page for multiple packets.
When needed (for example with large frames processing), allocated
pageorder could be customized. This gives even larger page reuse
efficiency.
page_ref_count is used to track page users. If during rx refill
underlying page has users, we increase pg_off by rx frame size
thus the top half of the page is reused.
Signed-off-by: Igor Russkikh <igor.russkikh@aquantia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Atlantic driver used 14 bytes preallocated skb size. That made L3 protocol
processing inefficient because pskb_pull had to fetch all the L3/L4 headers
from extra fragments.
Specially on UDP flows that caused extra packet drops because CPU was
overloaded with pskb_pull.
This patch uses eth_get_headlen for skb preallocation.
Signed-off-by: Igor Russkikh <igor.russkikh@aquantia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This series includes updates to mlx5 driver,
1) Compiler warnings cleanup from Saeed Mahameed
2) Parav Pandit simplifies sriov enable/disables
3) Gustavo A. R. Silva, Removes a redundant assignment
4) Moshe Shemesh, Adds Geneve tunnel stateless offload support
5) Eli Britstein, Adds the Support for VLAN modify action and
Replaces TC VLAN pop and push actions with VLAN modify
Note: This series includes two simple non-mlx5 patches,
1) Declare IANA_VXLAN_UDP_PORT definition in include/net/vxlan.h,
and use it in some drivers.
2) Declare GENEVE_UDP_PORT definition in include/net/geneve.h,
and use it in mlx5 and nfp drivers.
-----BEGIN PGP SIGNATURE-----
iQEcBAABAgAGBQJclTLsAAoJEEg/ir3gV/o+7bEH/1sz4oKP2mfhKSbG/I/g7Q3D
ifnccYq2EyXd1HzeglXpzLndO8wPve9qr/ANKrrKIYYCxc8FpCdb4aJD1Ucuylbb
XHHdfbTIPMa3vjhKtR/Fydht4RkY5IBBsgXywBcNL3ofxmnleNt9JRSr76Yhr2sy
Q3H30X+UvwAAQJBY1X+P8RiJcSklLu0UPG2KtTXcCz8YRgOWK0JtEiQyQu6yET4u
zbVxYixwKgsR9uhwNXqLxVMsaWFue9cYmVSMLigDx7fRZvj6Ao9REEUflt1hCEoR
jOXm1Avnsg9TKnwmgiBjrWQQQ4h+IMfZLK8EtuxVcraBUjtQRVnPak5JjZMjDuc=
=7t4R
-----END PGP SIGNATURE-----
Merge tag 'mlx5-updates-2019-03-20' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux
Saeed Mahameed says:
====================
mlx5-updates-2019-03-20
This series includes updates to mlx5 driver,
1) Compiler warnings cleanup from Saeed Mahameed
2) Parav Pandit simplifies sriov enable/disables
3) Gustavo A. R. Silva, Removes a redundant assignment
4) Moshe Shemesh, Adds Geneve tunnel stateless offload support
5) Eli Britstein, Adds the Support for VLAN modify action and
Replaces TC VLAN pop and push actions with VLAN modify
Note: This series includes two simple non-mlx5 patches,
1) Declare IANA_VXLAN_UDP_PORT definition in include/net/vxlan.h,
and use it in some drivers.
2) Declare GENEVE_UDP_PORT definition in include/net/geneve.h,
and use it in mlx5 and nfp drivers.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Jeff Kirsher says:
====================
100GbE Intel Wired LAN Driver Updates 2019-03-22
This series contains updates to ice driver only.
Akeem enables MAC anti-spoofing by default when a new VSI is being
created. Fixes an issue when reclaiming VF resources back to the pool
after reset, by freeing VF resources separately using the first VF
vector index to traverse the list, instead of starting at the last
assigned vectors list. Added support for VF & PF promiscuous mode in
the ice driver. Fixed the PF driver from letting the VF know it is "not
trusted" when it attempts to add more than its permitted additional MAC
addresses. Altered how the driver gets the VF VSIs instances, instead
of using the mailbox messages to retrieve VSIs, get it directly via the
VF object in the PF data structure.
Bruce fixes return values to resolve static analysis warnings. Made
whitespace changes to increase readability and reduce code wrapping.
Anirudh cleans up code by removing a function prototype that was never
implemented and removed an unused field in the ice_sched_vsi_info
structure.
Kiran fixes a potential divide by zero issue by adding a check.
Victor cleans up the transmit scheduler by adjusting the stack variable
usage and added/modified debug prints to make them more useful.
Yashaswini updates the driver in VEB mode to ensure that the LAN_EN bit
is set if all the right conditions are met.
Christopher ensures the loopback enable bit is not set for prune switch
rules, since all transmit traffic would be looped back to the internal
switch and dropped.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet says:
====================
tcp: add rx/tx cache to reduce lock contention
On hosts with many cpus we can observe a very serious contention
on spinlocks used in mm slab layer.
The following can happen quite often :
1) TX path
sendmsg() allocates one (fclone) skb on CPU A, sends a clone.
ACK is received on CPU B, and consumes the skb that was in the retransmit
queue.
2) RX path
network driver allocates skb on CPU C
recvmsg() happens on CPU D, freeing the skb after it has been delivered
to user space.
In both cases, we are hitting the asymetric alloc/free pattern
for which slab has to drain alien caches. At 8 Mpps per second,
this represents 16 Mpps alloc/free per second and has a huge penalty.
In an interesting experiment, I tried to use a single kmem_cache for all the skbs
(in skb_init() : skbuff_fclone_cache = skbuff_head_cache =
kmem_cache_create("skbuff_fclone_cache", sizeof(struct sk_buff_fclones),);
qnd most of the contention disappeared, since cpus could better use
their local slab per-cpu cache.
But we can do actually better, in the following patches.
TX : at ACK time, no longer free the skb but put it back in a tcp socket cache,
so that next sendmsg() can reuse it immediately.
RX : at recvmsg() time, do not free the skb but put it in a tcp socket cache
so that it can be freed by the cpu feeding the incoming packets in BH.
This increased the performance of small RPC benchmark by about 10 % on a host
with 112 hyperthreads.
v2 : - Solved a race condition : sk_stream_alloc_skb() to make sure the prior
clone has been freed.
- Really test rps_needed in sk_eat_skb() as claimed.
- Fixed rps_needed use in drivers/net/tun.c
v3: Added a #ifdef CONFIG_RPS, to avoid compile error (kbuild robot)
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Often times, recvmsg() system calls and BH handling for a particular
TCP socket are done on different cpus.
This means the incoming skb had to be allocated on a cpu,
but freed on another.
This incurs a high spinlock contention in slab layer for small rpc,
but also a high number of cache line ping pongs for larger packets.
A full size GRO packet might use 45 page fragments, meaning
that up to 45 put_page() can be involved.
More over performing the __kfree_skb() in the recvmsg() context
adds a latency for user applications, and increase probability
of trapping them in backlog processing, since the BH handler
might found the socket owned by the user.
This patch, combined with the prior one increases the rpc
performance by about 10 % on servers with large number of cores.
(tcp_rr workload with 10,000 flows and 112 threads reach 9 Mpps
instead of 8 Mpps)
This also increases single bulk flow performance on 40Gbit+ links,
since in this case there are often two cpus working in tandem :
- CPU handling the NIC rx interrupts, feeding the receive queue,
and (after this patch) freeing the skbs that were consumed.
- CPU in recvmsg() system call, essentially 100 % busy copying out
data to user space.
Having at most one skb in a per-socket cache has very little risk
of memory exhaustion, and since it is protected by socket lock,
its management is essentially free.
Note that if rps/rfs is used, we do not enable this feature, because
there is high chance that the same cpu is handling both the recvmsg()
system call and the TCP rx path, but that another cpu did the skb
allocations in the device driver right before the RPS/RFS logic.
To properly handle this case, it seems we would need to record
on which cpu skb was allocated, and use a different channel
to give skbs back to this cpu.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
On hosts with a lot of cores, RPC workloads suffer from heavy contention on slab spinlocks.
20.69% [kernel] [k] queued_spin_lock_slowpath
5.64% [kernel] [k] _raw_spin_lock
3.83% [kernel] [k] syscall_return_via_sysret
3.48% [kernel] [k] __entry_text_start
1.76% [kernel] [k] __netif_receive_skb_core
1.64% [kernel] [k] __fget
For each sendmsg(), we allocate one skb, and free it at the time ACK packet comes.
In many cases, ACK packets are handled by another cpus, and this unfortunately
incurs heavy costs for slab layer.
This patch uses an extra pointer in socket structure, so that we try to reuse
the same skb and avoid these expensive costs.
We cache at most one skb per socket so this should be safe as far as
memory pressure is concerned.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We prefer static_branch_unlikely() over static_key_false() these days.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Paolo Abeni says:
====================
net: dev: BYPASS for lockless qdisc
This patch series is aimed at improving xmit performances of lockless qdisc
in the uncontended scenario.
After the lockless refactor pfifo_fast can't leverage the BYPASS optimization.
Due to retpolines the overhead for the avoidables enqueue and dequeue operations
has increased and we see measurable regressions.
The first patch introduces the BYPASS code path for lockless qdisc, and the
second one optimizes such path further. Overall this avoids up to 3 indirect
calls per xmit packet. Detailed performance figures are reported in the 2nd
patch.
v2 -> v3:
- qdisc_is_empty() has a const argument (Eric)
v1 -> v2:
- use really an 'empty' flag instead of 'not_empty', as
suggested by Eric
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
With commit c5ad119fb6 ("net: sched: pfifo_fast use skb_array")
pfifo_fast no longer benefit from the TCQ_F_CAN_BYPASS optimization.
Due to retpolines the cost of the enqueue()/dequeue() pair has become
relevant and we observe measurable regression for the uncontended
scenario when the packet-rate is below line rate.
After commit 46b1c18f9d ("net: sched: put back q.qlen into a
single location") we can check for empty qdisc with a reasonably
fast operation even for nolock qdiscs.
This change extends TCQ_F_CAN_BYPASS support to nolock qdisc.
The new chunk of code mirrors closely the existing one for traditional
qdisc, leveraging a newly introduced helper to read atomically the
qdisc length.
Tested with pktgen in queue xmit mode, with pfifo_fast, a MQ
device, and MQ root qdisc:
threads vanilla patched
kpps kpps
1 2465 2889
2 4304 5188
4 7898 9589
Same as above, but with a single queue device:
threads vanilla patched
kpps kpps
1 2556 2827
2 2900 2900
4 5000 5000
8 4700 4700
No mesaurable changes in the contended scenarios, and more 10%
improvement in the uncontended ones.
v1 -> v2:
- rebased after flag name change
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Tested-by: Ivan Vecera <ivecera@redhat.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Ivan Vecera <ivecera@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The queue is marked not empty after acquiring the seqlock,
and it's up to the NOLOCK qdisc clearing such flag on dequeue.
Since the empty status lays on the same cache-line of the
seqlock, it's always hot on cache during the updates.
This makes the empty flag update a little bit loosy. Given
the lack of synchronization between enqueue and dequeue, this
is unavoidable.
v2 -> v3:
- qdisc_is_empty() has a const argument (Eric)
v1 -> v2:
- use really an 'empty' flag instead of 'not_empty', as
suggested by Eric
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Ivan Vecera <ivecera@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add documentation to the tcp_ca_state enum, since this enum is
exposed in uapi.
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Cc: Sowmini Varadhan <sowmini05@gmail.com>
Acked-by: Sowmini Varadhan <sowmini05@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>