In order to remove the 'struct timespec' definition and the
timespec64_to_timespec() helper function, change over the in-kernel
definition of 'struct scm_timestamping' to use the __kernel_old_timespec
replacement and open-code the assignment.
Acked-by: Deepa Dinamani <deepa.kernel@gmail.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
There are two 'struct timeval' fields in 'struct rusage'.
Unfortunately the definition of timeval is now ambiguous when used in
user space with a libc that has a 64-bit time_t, and this also changes
the 'rusage' definition in user space in a way that is incompatible with
the system call interface.
While there is no good solution to avoid all ambiguity here, change
the definition in the kernel headers to be compatible with the kernel
ABI, using __kernel_old_timeval as an unambiguous base type.
In previous discussions, there was also a plan to add a replacement
for rusage based on 64-bit timestamps and nanosecond resolution,
i.e. 'struct __kernel_timespec'. I have patches for that as well,
if anyone thinks we should do that.
Reviewed-by: Cyrill Gorcunov <gorcunov@gmail.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
This is mainly a patch for clarification, and to let us remove
the time_t definition from the kernel to prevent new users from
creeping in that might not be y2038-safe.
All remaining uses of 'time_t' or '__kernel_time_t' are part of
the user API that cannot be changed by that either have a
replacement or that do not suffer from the y2038 overflow.
Acked-by: Deepa Dinamani <deepa.kernel@gmail.com>
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
There are two structures based on time_t that conflict between libc and
kernel: timeval and timespec. Both are now renamed to __kernel_old_timeval
and __kernel_old_timespec.
For time_t, the old typedef is still __kernel_time_t. There is nothing
wrong with that name, but it would be nice to not use that going forward
as this type is used almost only in deprecated interfaces because of
the y2038 overflow.
In the IPC headers (msgbuf.h, sembuf.h, shmbuf.h), __kernel_time_t is only
used for the 64-bit variants, which are not deprecated.
Change these to a plain 'long', which is the same type as __kernel_time_t
on all 64-bit architectures anyway, to reduce the number of users of the
old type.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
The 'struct timespec' definition can no longer be part of the uapi headers
because it conflicts with a a now incompatible libc definition. Also,
we really want to remove it in order to prevent new uses from creeping in.
The same namespace conflict exists with time_t, which should also be
removed. __kernel_time_t could be used safely, but adding 'old' in the
name makes it clearer that this should not be used for new interfaces.
Add a replacement __kernel_old_timespec structure and __kernel_old_time_t
along the lines of __kernel_old_timeval.
Acked-by: Deepa Dinamani <deepa.kernel@gmail.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
When using the kernel datapath, the upcall don't
include skb hash info relatived. That will introduce
some problem, because the hash of skb is important
in kernel stack. For example, VXLAN module uses
it to select UDP src port. The tx queue selection
may also use the hash in stack.
Hash is computed in different ways. Hash is random
for a TCP socket, and hash may be computed in hardware,
or software stack. Recalculation hash is not easy.
Hash of TCP socket is computed:
tcp_v4_connect
-> sk_set_txhash (is random)
__tcp_transmit_skb
-> skb_set_hash_from_sk
There will be one upcall, without information of skb
hash, to ovs-vswitchd, for the first packet of a TCP
session. The rest packets will be processed in Open vSwitch
modules, hash kept. If this tcp session is forward to
VXLAN module, then the UDP src port of first tcp packet
is different from rest packets.
TCP packets may come from the host or dockers, to Open vSwitch.
To fix it, we store the hash info to upcall, and restore hash
when packets sent back.
+---------------+ +-------------------------+
| Docker/VMs | | ovs-vswitchd |
+----+----------+ +-+--------------------+--+
| ^ |
| | |
| | upcall v restore packet hash (not recalculate)
| +-+--------------------+--+
| tap netdev | | vxlan module
+---------------> +--> Open vSwitch ko +-->
or internal type | |
+-------------------------+
Reported-at: https://mail.openvswitch.org/pipermail/ovs-dev/2019-October/364062.html
Signed-off-by: Tonghao Zhang <xiangxia.m.yue@gmail.com>
Acked-by: Pravin B Shelar <pshelar@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add a statx attribute bit STATX_ATTR_VERITY which will be set if the
file has fs-verity enabled. This is the statx() equivalent of
FS_VERITY_FL which is returned by FS_IOC_GETFLAGS.
This is useful because it allows applications to check whether a file is
a verity file without opening it. Opening a verity file can be
expensive because the fsverity_info is set up on open, which involves
parsing metadata and optionally verifying a cryptographic signature.
This is analogous to how various other bits are exposed through both
FS_IOC_GETFLAGS and statx(), e.g. the encrypt bit.
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Acked-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Eric Biggers <ebiggers@google.com>
We have the i915 security fixes to backmerge, but first
let's clear the decks for other drivers to avoid a bigger
mess.
Signed-off-by: Dave Airlie <airlied@redhat.com>
Add support for flow steering counters action with a non-base counter
ID (offset) for bulk counters.
When creating a flow counter object, save the bulk value. This value is
used when a flow action with a non-base counter ID is requested - to
validate that the required offset is in the range of the allocated bulk.
Link: https://lore.kernel.org/r/20191103140723.77411-1-leon@kernel.org
Signed-off-by: Yevgeny Kliteynik <kliteyn@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Reviewed-by: Mark Bloch <markb@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Jozsef Kadlecsik says:
====================
ipset patches for nf-next
- Add wildcard support to hash:net,iface which makes possible to
match interface prefixes besides complete interfaces names, from
Kristian Evensen.
====================
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
We can now get rid of the cmm_lock and completely rely on the balloon
compaction internals, which now also manage the page list and the
lock.
Inflated/"loaned" pages are now movable. Memory blocks that contain
such pages can get offlined. Also, all such pages will be marked
PageOffline() and can therefore be excluded in memory dumps using
recent versions of makedumpfile.
Don't switch to balloon_page_alloc() yet (due to the GFP_NOIO). Will
do that separately to discuss this change in detail.
Signed-off-by: David Hildenbrand <david@redhat.com>
[mpe: Add isolated_pages-- in cmm_migratepage() as suggested by David]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20191031142933.10779-9-david@redhat.com
This patch adds the NFTA_FLOWTABLE_FLAGS attribute that allows users to
specify the NF_FLOWTABLE_HW_OFFLOAD flag. This patch also adds a new
setup interface for the flowtable type to perform the flowtable offload
block callback configuration.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add the GPIOHANDLE_SET_CONFIG_IOCTL to the gpio chardev.
The ioctl allows some of the configuration of a requested handle to be
changed without having to release the line.
The primary use case is the changing of direction for bi-directional
lines.
Based on initial work by Bartosz Golaszewski <bgolaszewski@baylibre.com>
Signed-off-by: Kent Gibson <warthog618@gmail.com>
Signed-off-by: Bartosz Golaszewski <bgolaszewski@baylibre.com>
Allow pull up/down bias to be disabled, allowing the line to float
or to be biased only by external circuitry.
Use case is for where the bias has been applied previously, either
by default or by the user, but that setting may conflict with the
current use of the line.
Signed-off-by: Kent Gibson <warthog618@gmail.com>
Signed-off-by: Bartosz Golaszewski <bgolaszewski@baylibre.com>
Add pull-up/pull-down flags to the gpio line get and
set ioctl() calls. Use cases include a push button
that does not have an external resistor.
Addition use cases described by Limor Fried (ladyada) of
Adafruit in this PR for Adafruit_Blinka Python lib:
https://github.com/adafruit/Adafruit_Blinka/pull/59
Signed-off-by: Drew Fustini <drew@pdp7.com>
[Kent: added BIAS to GPIO flag names and restrict application to input
lines]
Signed-off-by: Kent Gibson <warthog618@gmail.com>
Signed-off-by: Bartosz Golaszewski <bgolaszewski@baylibre.com>
When setting the dump's time-stamp, use ktime_get_real in addition to
jiffies. This simplifies the user space implementation and bypasses
some inconsistent behavior with translating jiffies to current time.
The time taken is transformed into nsec, to comply with y2038 issue.
Fixes: c8e1da0bf9 ("devlink: Add health report functionality")
Signed-off-by: Aya Levin <ayal@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
One conflict in the BPF samples Makefile, some fixes in 'net' whilst
we were converting over to Makefile.target rules in 'net-next'.
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull networking fixes from David Miller:
1) BPF sample build fixes from Björn Töpel
2) Fix powerpc bpf tail call implementation, from Eric Dumazet.
3) DCCP leaks jiffies on the wire, fix also from Eric Dumazet.
4) Fix crash in ebtables when using dnat target, from Florian Westphal.
5) Fix port disable handling whne removing bcm_sf2 driver, from Florian
Fainelli.
6) Fix kTLS sk_msg trim on fallback to copy mode, from Jakub Kicinski.
7) Various KCSAN fixes all over the networking, from Eric Dumazet.
8) Memory leaks in mlx5 driver, from Alex Vesker.
9) SMC interface refcounting fix, from Ursula Braun.
10) TSO descriptor handling fixes in stmmac driver, from Jose Abreu.
11) Add a TX lock to synchonize the kTLS TX path properly with crypto
operations. From Jakub Kicinski.
12) Sock refcount during shutdown fix in vsock/virtio code, from Stefano
Garzarella.
13) Infinite loop in Intel ice driver, from Colin Ian King.
* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (108 commits)
ixgbe: need_wakeup flag might not be set for Tx
i40e: need_wakeup flag might not be set for Tx
igb/igc: use ktime accessors for skb->tstamp
i40e: Fix for ethtool -m issue on X722 NIC
iavf: initialize ITRN registers with correct values
ice: fix potential infinite loop because loop counter being too small
qede: fix NULL pointer deref in __qede_remove()
net: fix data-race in neigh_event_send()
vsock/virtio: fix sock refcnt holding during the shutdown
net: ethernet: octeon_mgmt: Account for second possible VLAN header
mac80211: fix station inactive_time shortly after boot
net/fq_impl: Switch to kvmalloc() for memory allocation
mac80211: fix ieee80211_txq_setup_flows() failure path
ipv4: Fix table id reference in fib_sync_down_addr
ipv6: fixes rt6_probe() and fib6_nh->last_probe init
net: hns: Fix the stray netpoll locks causing deadlock in NAPI path
net: usb: qmi_wwan: add support for DW5821e with eSIM support
CDC-NCM: handle incomplete transfer of MTU
nfc: netlink: fix double device reference drop
NFC: st21nfca: fix double free
...
Pull block fixes from Jens Axboe:
- Two NVMe device removal crash fixes, and a compat fixup for for an
ioctl that was introduced in this release (Anton, Charles, Max - via
Keith)
- Missing error path mutex unlock for drbd (Dan)
- cgroup writeback fixup on dead memcg (Tejun)
- blkcg online stats print fix (Tejun)
* tag 'for-linus-2019-11-08' of git://git.kernel.dk/linux-block:
cgroup,writeback: don't switch wbs immediately on dead wbs if the memcg is dead
block: drbd: remove a stray unlock in __drbd_send_protocol()
blkcg: make blkcg_print_stat() print stats only for online blkgs
nvme: change nvme_passthru_cmd64 to explicitly mark rsvd
nvme-multipath: fix crash in nvme_mpath_clear_ctrl_paths
nvme-rdma: fix a segmentation fault during module unload
Section 7.2 of rfc7829: "Peer Address Thresholds (SCTP_PEER_ADDR_THLDS)
Socket Option" extends 'struct sctp_paddrthlds' with 'spt_pathcpthld'
added to allow a user to change ps_retrans per sock/asoc/transport, as
other 2 paddrthlds: pf_retrans, pathmaxrxt.
Note: to not break the user's program, here to support pf_retrans dump
and setting by adding a new sockopt SCTP_PEER_ADDR_THLDS_V2, and a new
structure sctp_paddrthlds_v2 instead of extending sctp_paddrthlds.
Also, when setting ps_retrans, the value is not allowed to be greater
than pf_retrans.
v1->v2:
- use SCTP_PEER_ADDR_THLDS_V2 to set/get pf_retrans instead,
as Marcelo and David Laight suggested.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This is a sockopt defined in section 7.3 of rfc7829: "Exposing
the Potentially Failed Path State", by which users can change
pf_expose per sock and asoc.
The new sockopt SCTP_EXPOSE_POTENTIALLY_FAILED_STATE is also
known as SCTP_EXPOSE_PF_STATE for short.
v2->v3:
- return -EINVAL if params.assoc_value > SCTP_PF_EXPOSE_MAX.
- define SCTP_EXPOSE_PF_STATE SCTP_EXPOSE_POTENTIALLY_FAILED_STATE.
v3->v4:
- improve changelog.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
SCTP Quick failover draft section 5.1, point 5 has been removed
from rfc7829. Instead, "the sender SHOULD (i) notify the Upper
Layer Protocol (ULP) about this state transition", as said in
section 3.2, point 8.
So this patch is to add SCTP_ADDR_POTENTIALLY_FAILED, defined
in section 7.1, "which is reported if the affected address
becomes PF". Also remove transport cwnd's update when moving
from PF back to ACTIVE , which is no longer in rfc7829 either.
Note that ulp_notify will be set to false if asoc->expose is
not 'enabled', according to last patch.
v2->v3:
- define SCTP_ADDR_PF SCTP_ADDR_POTENTIALLY_FAILED.
v3->v4:
- initialize spc_state with SCTP_ADDR_AVAILABLE, as Marcelo suggested.
- check asoc->pf_expose in sctp_assoc_control_transport(), as Marcelo
suggested.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
As said in rfc7829, section 3, point 12:
The SCTP stack SHOULD expose the PF state of its destination
addresses to the ULP as well as provide the means to notify the
ULP of state transitions of its destination addresses from
active to PF, and vice versa. However, it is recommended that
an SCTP stack implementing SCTP-PF also allows for the ULP to be
kept ignorant of the PF state of its destinations and the
associated state transitions, thus allowing for retention of the
simpler state transition model of [RFC4960] in the ULP.
Not only does it allow to expose the PF state to ULP, but also
allow to ignore sctp-pf to ULP.
So this patch is to add pf_expose per netns, sock and asoc. And in
sctp_assoc_control_transport(), ulp_notify will be set to false if
asoc->expose is not 'enabled' in next patch.
It also allows a user to change pf_expose per netns by sysctl, and
pf_expose per sock and asoc will be initialized with it.
Note that pf_expose also works for SCTP_GET_PEER_ADDR_INFO sockopt,
to not allow a user to query the state of a sctp-pf peer address
when pf_expose is 'disabled', as said in section 7.3.
v1->v2:
- Fix a build warning noticed by Nathan Chancellor.
v2->v3:
- set pf_expose to UNUSED by default to keep compatible with old
applications.
v3->v4:
- add a new entry for pf_expose on ip-sysctl.txt, as Marcelo suggested.
- change this patch to 1/5, and move sctp_assoc_control_transport
change into 2/5, as Marcelo suggested.
- use SCTP_PF_EXPOSE_UNSET instead of SCTP_PF_EXPOSE_UNUSED, and
set SCTP_PF_EXPOSE_UNSET to 0 in enum, as Marcelo suggested.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This commit adds two netlink commands to TIPC in order for user to be
able to set or remove AEAD keys:
- TIPC_NL_KEY_SET
- TIPC_NL_KEY_FLUSH
When the 'KEY_SET' is given along with the key data, the key will be
initiated and attached to TIPC crypto. On the other hand, the
'KEY_FLUSH' command will remove all existing keys if any.
Acked-by: Ying Xue <ying.xue@windreiver.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Tuong Lien <tuong.t.lien@dektech.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
The new structure 'tipc_aead_key' is added to the 'tipc.h' for user to
be able to transfer a key to TIPC in kernel. Netlink will be used for
this purpose in the later commits.
Acked-by: Ying Xue <ying.xue@windreiver.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Tuong Lien <tuong.t.lien@dektech.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Johannes Berg says:
====================
Some relatively small changes:
* typo fixes in docs
* APIs for station separation using VLAN tags rather
than separate wifi netdevs
* some preparations for upcoming features (802.3 offload
and airtime queue limits (AQL)
* stack reduction in ieee80211_assoc_success()
* use DEFINE_DEBUGFS_ATTRIBUTE in hwsim
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
This provides an alternative mechanism for AP VLAN support where a
single netdev is used with VLAN tagged frames instead of separate
netdevs for each VLAN without tagged frames from the WLAN driver.
By setting NL80211_EXT_FEATURE_VLAN_OFFLOAD flag the driver indicates
support for a single netdev with VLAN tagged frames. Separate
VLAN-specific netdevs can be added using RTM_NEWLINK/IFLA_VLAN_ID
similarly to Ethernet. NL80211_CMD_NEW_KEY (for group keys),
NL80211_CMD_NEW_STATION, and NL80211_CMD_SET_STATION will optionally
specify vlan_id using NL80211_ATTR_VLAN_ID.
Signed-off-by: Gurumoorthi Gnanasambandhan <gguru@codeaurora.org>
Signed-off-by: Jouni Malinen <jouni@codeaurora.org>
Link: https://lore.kernel.org/r/20191031214640.5012-1-jouni@codeaurora.org
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Pull on for-linus to resolve what otherwise would have been a conflict
with the cgroups rstat patchset from Tejun.
* for-linus: (942 commits)
blkcg: make blkcg_print_stat() print stats only for online blkgs
nvme: change nvme_passthru_cmd64 to explicitly mark rsvd
nvme-multipath: fix crash in nvme_mpath_clear_ctrl_paths
nvme-rdma: fix a segmentation fault during module unload
iocost: don't nest spin_lock_irq in ioc_weight_write()
io_uring: ensure we clear io_kiocb->result before each issue
um-ubd: Entrust re-queue to the upper layers
nvme-multipath: remove unused groups_only mode in ana log
nvme-multipath: fix possible io hang after ctrl reconnect
io_uring: don't touch ctx in setup after ring fd install
io_uring: Fix leaked shadow_req
Linux 5.4-rc5
riscv: cleanup do_trap_break
nbd: verify socket is supported during setup
ata: libahci_platform: Fix regulator_get_optional() misuse
nbd: handle racing with error'ed out commands
nbd: protect cmd->status with cmd->lock
io_uring: fix bad inflight accounting for SETUP_IOPOLL|SETUP_SQTHREAD
io_uring: used cached copies of sq->dropped and cq->overflow
ARM: dts: stm32: relax qspi pins slew-rate for stm32mp157
...
Currently we drop completion events, if the CQ ring is full. That's fine
for requests with bounded completion times, but it may make it harder or
impossible to use io_uring with networked IO where request completion
times are generally unbounded. Or with POLL, for example, which is also
unbounded.
After this patch, we never overflow the ring, we simply store requests
in a backlog for later flushing. This flushing is done automatically by
the kernel. To prevent the backlog from growing indefinitely, if the
backlog is non-empty, we apply back pressure on IO submissions. Any
attempt to submit new IO with a non-empty backlog will get an -EBUSY
return from the kernel. This is a signal to the application that it has
backlogged CQ events, and that it must reap those before being allowed
to submit more IO.
Note that if we do return -EBUSY, we will have filled whatever
backlogged events into the CQ ring first, if there's room. This means
the application can safely reap events WITHOUT entering the kernel and
waiting for them, they are already available in the CQ ring.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
While we have support for generic timeouts, we don't have a way to tie
a timeout to a specific SQE. The generic timeouts simply trigger wakeups
on the CQ ring.
This adds support for IORING_OP_LINK_TIMEOUT. This command is only valid
as a link to a previous command. The timeout specific can be either
relative or absolute, following the same rules as IORING_OP_TIMEOUT. If
the timeout triggers before the dependent command completes, it will
attempt to cancel that command. Likewise, if the dependent command
completes before the timeout triggers, it will cancel the timeout.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Introduce three new ioctl commands BLKOPENZONE, BLKCLOSEZONE and
BLKFINISHZONE to allow applications to control the condition of zones
on a zoned block device through the execution of the REQ_OP_ZONE_OPEN,
REQ_OP_ZONE_CLOSE and REQ_OP_ZONE_FINISH operations.
Contains contributions from Matias Bjorling, Hans Holmberg,
Dmitry Fomichev, Keith Busch, Damien Le Moal and Christoph Hellwig.
Reviewed-by: Javier González <javier@javigon.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ajay Joshi <ajay.joshi@wdc.com>
Signed-off-by: Matias Bjorling <matias.bjorling@wdc.com>
Signed-off-by: Hans Holmberg <hans.holmberg@wdc.com>
Signed-off-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Based on the code framework built on the last patch, to
support setting and dumping for vxlan, we only need to
add ip_tun_parse_opts_erspan() for .build_state and
ip_tun_fill_encap_opts_erspan() for .fill_encap and
if (tun_flags & TUNNEL_ERSPAN_OPT) for .get_encap_size.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Based on the code framework built on the last patch, to
support setting and dumping for vxlan, we only need to
add ip_tun_parse_opts_vxlan() for .build_state and
ip_tun_fill_encap_opts_vxlan() for .fill_encap and
if (tun_flags & TUNNEL_VXLAN_OPT) for .get_encap_size.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
To add options setting and dumping, .build_state(), .fill_encap() and
.get_encap_size() in ip_tun_lwt_ops needs to be extended:
ip_tun_build_state():
ip_tun_parse_opts():
ip_tun_parse_opts_geneve()
ip_tun_fill_encap_info():
ip_tun_fill_encap_opts():
ip_tun_fill_encap_opts_geneve()
ip_tun_encap_nlsize()
ip_tun_opts_nlsize():
if (tun_flags & TUNNEL_GENEVE_OPT)
ip_tun_parse_opts(), ip_tun_fill_encap_opts() and ip_tun_opts_nlsize()
processes LWTUNNEL_IP_OPTS.
ip_tun_parse_opts_geneve(), ip_tun_fill_encap_opts_geneve() and
if (tun_flags & TUNNEL_GENEVE_OPT) processes LWTUNNEL_IP_OPTS_GENEVE.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Inline encryption hardware compliant with the UFS v2.1 standard or with
the upcoming version of the eMMC standard has the following properties:
(1) Per I/O request, the encryption key is specified by a previously
loaded keyslot. There might be only a small number of keyslots.
(2) Per I/O request, the starting IV is specified by a 64-bit "data unit
number" (DUN). IV bits 64-127 are assumed to be 0. The hardware
automatically increments the DUN for each "data unit" of
configurable size in the request, e.g. for each filesystem block.
Property (1) makes it inefficient to use the traditional fscrypt
per-file keys. Property (2) precludes the use of the existing
DIRECT_KEY fscrypt policy flag, which needs at least 192 IV bits.
Therefore, add a new fscrypt policy flag IV_INO_LBLK_64 which causes the
encryption to modified as follows:
- The encryption keys are derived from the master key, encryption mode
number, and filesystem UUID.
- The IVs are chosen as (inode_number << 32) | file_logical_block_num.
For filenames encryption, file_logical_block_num is 0.
Since the file nonces aren't used in the key derivation, many files may
share the same encryption key. This is much more efficient on the
target hardware. Including the inode number in the IVs and mixing the
filesystem UUID into the keys ensures that data in different files is
nevertheless still encrypted differently.
Additionally, limiting the inode and block numbers to 32 bits and
placing the block number in the low bits maintains compatibility with
the 64-bit DUN convention (property (2) above).
Since this scheme assumes that inode numbers are stable (which may
preclude filesystem shrinking) and that inode and file logical block
numbers are at most 32-bit, IV_INO_LBLK_64 will only be allowed on
filesystems that meet these constraints. These are acceptable
limitations for the cases where this format would actually be used.
Note that IV_INO_LBLK_64 is an on-disk format, not an implementation.
This patch just adds support for it using the existing filesystem layer
encryption. A later patch will add support for inline encryption.
Reviewed-by: Paul Crowley <paulcrowley@google.com>
Co-developed-by: Satya Tangirala <satyat@google.com>
Signed-off-by: Satya Tangirala <satyat@google.com>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Now the kernel uses 64bit packet counters in scheduler layer,
we want to export these counters to user space.
Instead risking breaking user space by adding fields
to struct gnet_stats_basic, add a new TCA_STATS_PKT64.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
gnet_stats_basic_packed was really meant to be private kernel structure.
If this proves to be a problem, we will have to rename the in-kernel
version.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Changing nvme_passthru_cmd64 to add a field: rsvd2. This field is an explicit
marker for the padding space added on certain platforms as a result of the
enlargement of the result field from 32 bit to 64 bits in size, and
fixes differences in struct size when using compat ioctl for 32-bit
binaries on 64-bit architecture.
Fixes: 65e68edce0 ("nvme: allow 64-bit results in passthru commands")
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Charles Machalow <csm10495@gmail.com>
[changelog]
Signed-off-by: Keith Busch <kbusch@kernel.org>
Pull clone3 stack argument update from Christian Brauner:
"This changes clone3() to do basic stack validation and to set up the
stack depending on whether or not it is growing up or down.
With clone3() the expectation is now very simply that the .stack
argument points to the lowest address of the stack and that
.stack_size specifies the initial stack size. This is diferent from
legacy clone() where the "stack" argument had to point to the lowest
or highest address of the stack depending on the architecture.
clone3() was released with 5.3. Currently, it is not documented and
very unclear to userspace how the stack and stack_size argument have
to be passed. After talking to glibc folks we concluded that changing
clone3() to determine stack direction and doing basic validation is
the right course of action.
Note, this is a potentially user visible change. In the very unlikely
case, that it breaks someone's use-case we will revert. (And then e.g.
place the new behavior under an appropriate flag.)
Note that passing an empty stack will continue working just as before.
Breaking someone's use-case is very unlikely. Neither glibc nor musl
currently expose a wrapper for clone3(). There is currently also no
real motivation for anyone to use clone3() directly. First, because
using clone{3}() with stacks requires some assembly (see glibc and
musl). Second, because it does not provide features that legacy
clone() doesn't. New features for clone3() will first happen in v5.5
which is why v5.4 is still a good time to try and make that change now
and backport it to v5.3.
I did a codesearch on https://codesearch.debian.net, github, and
gitlab and could not find any software currently relying directly on
clone3(). I expect this to change once we land CLONE_CLEAR_SIGHAND
which was a request coming from glibc at which point they'll likely
start using it"
* tag 'for-linus-2019-11-05' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
clone3: validate stack arguments
Validate the stack arguments and setup the stack depening on whether or not
it is growing down or up.
Legacy clone() required userspace to know in which direction the stack is
growing and pass down the stack pointer appropriately. To make things more
confusing microblaze uses a variant of the clone() syscall selected by
CONFIG_CLONE_BACKWARDS3 that takes an additional stack_size argument.
IA64 has a separate clone2() syscall which also takes an additional
stack_size argument. Finally, parisc has a stack that is growing upwards.
Userspace therefore has a lot nasty code like the following:
#define __STACK_SIZE (8 * 1024 * 1024)
pid_t sys_clone(int (*fn)(void *), void *arg, int flags, int *pidfd)
{
pid_t ret;
void *stack;
stack = malloc(__STACK_SIZE);
if (!stack)
return -ENOMEM;
#ifdef __ia64__
ret = __clone2(fn, stack, __STACK_SIZE, flags | SIGCHLD, arg, pidfd);
#elif defined(__parisc__) /* stack grows up */
ret = clone(fn, stack, flags | SIGCHLD, arg, pidfd);
#else
ret = clone(fn, stack + __STACK_SIZE, flags | SIGCHLD, arg, pidfd);
#endif
return ret;
}
or even crazier variants such as [3].
With clone3() we have the ability to validate the stack. We can check that
when stack_size is passed, the stack pointer is valid and the other way
around. We can also check that the memory area userspace gave us is fine to
use via access_ok(). Furthermore, we probably should not require
userspace to know in which direction the stack is growing. It is easy
for us to do this in the kernel and I couldn't find the original
reasoning behind exposing this detail to userspace.
/* Intentional user visible API change */
clone3() was released with 5.3. Currently, it is not documented and very
unclear to userspace how the stack and stack_size argument have to be
passed. After talking to glibc folks we concluded that trying to change
clone3() to setup the stack instead of requiring userspace to do this is
the right course of action.
Note, that this is an explicit change in user visible behavior we introduce
with this patch. If it breaks someone's use-case we will revert! (And then
e.g. place the new behavior under an appropriate flag.)
Breaking someone's use-case is very unlikely though. First, neither glibc
nor musl currently expose a wrapper for clone3(). Second, there is no real
motivation for anyone to use clone3() directly since it does not provide
features that legacy clone doesn't. New features for clone3() will first
happen in v5.5 which is why v5.4 is still a good time to try and make that
change now and backport it to v5.3. Searches on [4] did not reveal any
packages calling clone3().
[1]: https://lore.kernel.org/r/CAG48ez3q=BeNcuVTKBN79kJui4vC6nw0Bfq6xc-i0neheT17TA@mail.gmail.com
[2]: https://lore.kernel.org/r/20191028172143.4vnnjpdljfnexaq5@wittgenstein
[3]: 5238e95759/src/basic/raw-clone.h (L31)
[4]: https://codesearch.debian.net
Fixes: 7f192e3cd3 ("fork: add clone3")
Cc: Kees Cook <keescook@chromium.org>
Cc: Jann Horn <jannh@google.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Florian Weimer <fweimer@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: linux-api@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: <stable@vger.kernel.org> # 5.3
Cc: GNU C Library <libc-alpha@sourceware.org>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Aleksa Sarai <cyphar@cyphar.com>
Link: https://lore.kernel.org/r/20191031113608.20713-1-christian.brauner@ubuntu.com
The net,iface equal functions currently compares the full interface
names. In several cases, wildcard (or prefix) matching is useful. For
example, when converting a large iptables rule-set to make use of ipset,
I was able to significantly reduce the number of set elements by making
use of wildcard matching.
Wildcard matching is enabled by adding "wildcard" when adding an element
to a set. Internally, this causes the IPSET_FLAG_IFACE_WILDCARD-flag to
be set. When this flag is set, only the initial part of the interface
name is used for comparison.
Wildcard matching is done per element and not per set, as there are many
cases where mixing wildcard and non-wildcard elements are useful. This
means that is up to the user to handle (avoid) overlapping interface
names.
Signed-off-by: Kristian Evensen <kristian.evensen@gmail.com>
Signed-off-by: Jozsef Kadlecsik <kadlec@netfilter.org>
This feature gives the user RW access to any opal table with admin1
authority. The flags described in the new structure determines if the user
wants to read/write the data. Flags are checked for valid values in
order to allow future features to be added to the ioctl.
The user can provide the desired table's UID. Also, the ioctl provides a
size and offset field and internally will loop data accesses to return
the full data block. Read overrun is prevented by the initiator's
sec_send_recv() backend. The ioctl provides a private field with the
intention to accommodate any future expansions to the ioctl.
Reviewed-by: Scott Bauer <sbauer@plzdonthack.me>
Reviewed-by: Jon Derrick <jonathan.derrick@intel.com>
Signed-off-by: Revanth Rajashekar <revanth.rajashekar@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>