Pablo Neira Ayuso says:
====================
Netfilter updates for net-next
The following patchset contains Netfilter updates for your net-next
tree, most relevantly they are:
1) Extend nft_exthdr to allow to match TCP options bitfields, from
Manuel Messner.
2) Allow to check if IPv6 extension header is present in nf_tables,
from Phil Sutter.
3) Allow to set and match conntrack zone in nf_tables, patches from
Florian Westphal.
4) Several patches for the nf_tables set infrastructure, this includes
cleanup and preparatory patches to add the new bitmap set type.
5) Add optional ruleset generation ID check to nf_tables and allow to
delete rules that got no public handle yet via NFTA_RULE_ID. These
patches add the missing kernel infrastructure to support rule
deletion by description from userspace.
6) Missing NFT_SET_OBJECT flag to select the right backend when sets
stores an object map.
7) A couple of cleanups for the expectation and SIP helper, from Gao
feng.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
If BPF_F_ALLOW_OVERRIDE flag is used in BPF_PROG_ATTACH command
to the given cgroup the descendent cgroup will be able to override
effective bpf program that was inherited from this cgroup.
By default it's not passed, therefore override is disallowed.
Examples:
1.
prog X attached to /A with default
prog Y fails to attach to /A/B and /A/B/C
Everything under /A runs prog X
2.
prog X attached to /A with allow_override.
prog Y fails to attach to /A/B with default (non-override)
prog M attached to /A/B with allow_override.
Everything under /A/B runs prog M only.
3.
prog X attached to /A with allow_override.
prog Y fails to attach to /A with default.
The user has to detach first to switch the mode.
In the future this behavior may be extended with a chain of
non-overridable programs.
Also fix the bug where detach from cgroup where nothing is attached
was not throwing error. Return ENOENT in such case.
Add several testcases and adjust libbpf.
Fixes: 3007098494 ("cgroup: add support for eBPF programs")
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Daniel Mack <daniel@zonque.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This new attribute allows us to uniquely identify a rule in transaction.
Robots may trigger an insertion followed by deletion in a batch, in that
scenario we still don't have a public rule handle that we can use to
delete the rule. This is similar to the NFTA_SET_ID attribute that
allows us to refer to an anonymous set from a batch.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This patch allows userspace to specify the generation ID that has been
used to build an incremental batch update.
If userspace specifies the generation ID in the batch message as
attribute, then nfnetlink compares it to the current generation ID so
you make sure that you work against the right baseline. Otherwise, bail
out with ERESTART so userspace knows that its changeset is stale and
needs to respin. Userspace can do this transparently at the cost of
taking slightly more time to refresh caches and rework the changeset.
This check is optional, if there is no NFNL_BATCH_GENID attribute in the
batch begin message, then no check is performed.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Pull rdma fixes from Doug Ledford:
"Third round of -rc fixes for 4.10 kernel:
- two security related issues in the rxe driver
- one compile issue in the RDMA uapi header"
* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma:
RDMA: Don't reference kernel private header from UAPI header
IB/rxe: Fix mem_check_range integer overflow
IB/rxe: Fix resid update
Per PCIe r3.1, sec 6.2.10 and sec 7.13.4, on Root Ports that support "RP
Extensions for DPC",
When the DPC Trigger Status bit is Set and the DPC RP Busy bit is Set,
software must leave the Root Port in DPC until the DPC RP Busy bit reads
0b.
Wait up to 1 second for the Root Port to become non-busy.
[bhelgaas: changelog, spec references]
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
The eswitch_[gs]et command is supposed to be similar to port_[gs]et
command - for multiple eswitch attributes. However, when it was introduced
by 08f4b5918b ("net/devlink: Add E-Switch mode control") it was wrongly
named with the word "mode" in it. So fix this now, make the oririnal
enum value existing but obsolete.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Johannes Berg says:
====================
Some more updates:
* use shash in mac80211 crypto code where applicable
* some documentation fixes
* pass RSSI levels up in change notifications
* remove unused rfkill-regulator
* various other cleanups
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
This command could be useful to inc/dec fields.
For example, to forward any TCP packet and decrease its TTL:
$ tc filter add dev enp0s9 protocol ip parent ffff: \
flower ip_proto tcp \
action pedit munge ip ttl add 0xff pipe \
action mirred egress redirect dev veth0
In the example above, adding 0xff to this u8 field is actually
decreasing it by one, since the operation is masked.
Signed-off-by: Amir Vadai <amir@vadai.me>
Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Extend pedit to enable the user setting offset relative to network
headers. This change would enable to work with more complex header
schemes (vs the simple IPv4 case) where setting a fixed offset relative
to the network header is not enough.
After this patch, the action has information about the exact header type
and field inside this header. This information could be used later on
for hardware offloading of pedit.
Backward compatibility was being kept:
1. Old kernel <-> new userspace
2. New kernel <-> old userspace
3. add rule using new userspace <-> dump using old userspace
4. add rule using old userspace <-> dump using new userspace
When using the extended api, new netlink attributes are being used. This
way, operation will fail in (1) and (3) - and no malformed rule be added
or dumped. Of course, new user space that doesn't need the new
functionality can use the old netlink attributes and operation will
succeed.
Since action can support both api's, (2) should work, and it is easy to
write the new user space to have (4) work.
The action is having a strict check that only header types and commands
it can handle are accepted. This way future additions will be much
easier.
Usage example:
$ tc filter add dev enp0s9 protocol ip parent ffff: \
flower \
ip_proto tcp \
dst_port 80 \
action pedit munge tcp dport set 8080 pipe \
action mirred egress redirect dev veth0
Will forward tcp port whose original dest port is 80, while modifying
the destination port to 8080.
Signed-off-by: Amir Vadai <amir@vadai.me>
Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Previously all data passed over binder needed
to be serialized, with the exception of Binder
objects and file descriptors.
This patchs adds support for scatter-gathering raw
memory buffers into a binder transaction, avoiding
the need to first serialize them into a Parcel.
To remain backwards compatibile with existing
binder clients, it introduces two new command
ioctls for this purpose - BC_TRANSACTION_SG and
BC_REPLY_SG. These commands may only be used with
the new binder_transaction_data_sg structure,
which adds a field for the total size of the
buffers we are scatter-gathering.
Because memory buffers may contain pointers to
other buffers, we allow callers to specify
a parent buffer and an offset into it, to indicate
this is a location pointing to the buffer that
we are fixing up. The kernel will then take care
of fixing up the pointer to that buffer as well.
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Martijn Coenen <maco@google.com>
Cc: Arve Hjønnevåg <arve@android.com>
Cc: Amit Pundir <amit.pundir@linaro.org>
Cc: Serban Constantinescu <serban.constantinescu@arm.com>
Cc: Dmitry Shmidt <dimitrysh@google.com>
Cc: Rom Lemarchand <romlem@google.com>
Cc: Android Kernel Team <kernel-team@android.com>
Signed-off-by: Martijn Coenen <maco@google.com>
[jstultz: Fold in small fix from Amit Pundir <amit.pundir@linaro.org>]
Signed-off-by: John Stultz <john.stultz@linaro.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Stateful network admission policy may allow connections to one
direction and reject connections initiated in the other direction.
After policy change it is possible that for a new connection an
overlapping conntrack entry already exists, where the original
direction of the existing connection is opposed to the new
connection's initial packet.
Most importantly, conntrack state relating to the current packet gets
the "reply" designation based on whether the original direction tuple
or the reply direction tuple matched. If this "directionality" is
wrong w.r.t. to the stateful network admission policy it may happen
that packets in neither direction are correctly admitted.
This patch adds a new "force commit" option to the OVS conntrack
action that checks the original direction of an existing conntrack
entry. If that direction is opposed to the current packet, the
existing conntrack entry is deleted and a new one is subsequently
created in the correct direction.
Signed-off-by: Jarno Rajahalme <jarno@ovn.org>
Acked-by: Pravin B Shelar <pshelar@ovn.org>
Acked-by: Joe Stringer <joe@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add the fields of the conntrack original direction 5-tuple to struct
sw_flow_key. The new fields are initially marked as non-existent, and
are populated whenever a conntrack action is executed and either finds
or generates a conntrack entry. This means that these fields exist
for all packets that were not rejected by conntrack as untrackable.
The original tuple fields in the sw_flow_key are filled from the
original direction tuple of the conntrack entry relating to the
current packet, or from the original direction tuple of the master
conntrack entry, if the current conntrack entry has a master.
Generally, expected connections of connections having an assigned
helper (e.g., FTP), have a master conntrack entry.
The main purpose of the new conntrack original tuple fields is to
allow matching on them for policy decision purposes, with the premise
that the admissibility of tracked connections reply packets (as well
as original direction packets), and both direction packets of any
related connections may be based on ACL rules applying to the master
connection's original direction 5-tuple. This also makes it easier to
make policy decisions when the actual packet headers might have been
transformed by NAT, as the original direction 5-tuple represents the
packet headers before any such transformation.
When using the original direction 5-tuple the admissibility of return
and/or related packets need not be based on the mere existence of a
conntrack entry, allowing separation of admission policy from the
established conntrack state. While existence of a conntrack entry is
required for admission of the return or related packets, policy
changes can render connections that were initially admitted to be
rejected or dropped afterwards. If the admission of the return and
related packets was based on mere conntrack state (e.g., connection
being in an established state), a policy change that would make the
connection rejected or dropped would need to find and delete all
conntrack entries affected by such a change. When using the original
direction 5-tuple matching the affected conntrack entries can be
allowed to time out instead, as the established state of the
connection would not need to be the basis for packet admission any
more.
It should be noted that the directionality of related connections may
be the same or different than that of the master connection, and
neither the original direction 5-tuple nor the conntrack state bits
carry this information. If needed, the directionality of the master
connection can be stored in master's conntrack mark or labels, which
are automatically inherited by the expected related connections.
The fact that neither ARP nor ND packets are trackable by conntrack
allows mutual exclusion between ARP/ND and the new conntrack original
tuple fields. Hence, the IP addresses are overlaid in union with ARP
and ND fields. This allows the sw_flow_key to not grow much due to
this patch, but it also means that we must be careful to never use the
new key fields with ARP or ND packets. ARP is easy to distinguish and
keep mutually exclusive based on the ethernet type, but ND being an
ICMPv6 protocol requires a bit more attention.
Signed-off-by: Jarno Rajahalme <jarno@ovn.org>
Acked-by: Joe Stringer <joe@ovn.org>
Acked-by: Pravin B Shelar <pshelar@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Make the array of labels in struct ovs_key_ct_label an union, adding a
u32 array of the same byte size as the existing u8 array. It is
faster to loop through the labels 32 bits at the time, which is also
the alignment of netlink attributes.
Signed-off-by: Jarno Rajahalme <jarno@ovn.org>
Acked-by: Joe Stringer <joe@ovn.org>
Acked-by: Pravin B Shelar <pshelar@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch is to implement Sender-Side Procedures for the Add
Outgoing and Incoming Streams Request Parameter described in
rfc6525 section 5.1.5-5.1.6.
It is also to add sockopt SCTP_ADD_STREAMS in rfc6525 section
6.3.4 for users.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch is to implement Sender-Side Procedures for the SSN/TSN
Reset Request Parameter descibed in rfc6525 section 5.1.4.
It is also to add sockopt SCTP_RESET_ASSOC in rfc6525 section 6.3.3
for users.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The tracksticks on the Lenovo thinkpads have their buttons connected
through the touchpad device. We already fixed that in synaptics.c, but
when we switch the device into RMI4 mode to have proper support, the
pass-through functionality can't deal with them easily.
We add a new PS/2 flag and protocol designed for psmouse. The RMI4 F03
pass-through can then emit a special set of commands to notify psmouse the
state of the buttons.
This patch implements the protocol in psmouse, while an other will
do the same for rmi4-f03.
Signed-off-by: Benjamin Tissoires <benjamin.tissoires@redhat.com>
Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>
The nl80211_nan_dual_band_conf enumeration doesn't make much sense.
The default value is assigned to a bit, which makes it weird if the
default bit and other bits are set at the same time.
To improve this, get rid of NL80211_NAN_BAND_DEFAULT and add a wiphy
configuration to let the drivers define which bands are supported.
This is exposed to the userspace, which then can make a decision on
which band(s) to use. Additionally, rename all "dual_band" elements
to "bands", to make things clearer.
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Remove references to private kernel header and defines from exported
ib_user_verb.h file.
The code snippet below is used to reproduce the issue:
#include <stdio.h>
#include <rdma/ib_user_verb.h>
int main(void)
{
printf("IB_USER_VERBS_ABI_VERSION = %d\n", IB_USER_VERBS_ABI_VERSION);
return 0;
}
It fails during compilation phase with an error:
➜ /tmp gcc main.c
main.c:2:31: fatal error: rdma/ib_user_verb.h: No such file or directory
#include <rdma/ib_user_verb.h>
^
compilation terminated.
Fixes: 189aba99e7 ("IB/uverbs: Extend modify_qp and support packet pacing")
CC: Bodong Wang <bodong@mellanox.com>
CC: Matan Barak <matanb@mellanox.com>
CC: Christoph Hellwig <hch@infradead.org>
Tested-by: Slava Shwartsman <slavash@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
This patch implements the kernel side of the TCP option patch.
Signed-off-by: Manuel Messner <mm@skelett.io>
Reviewed-by: Florian Westphal <fw@strlen.de>
Acked-by: Phil Sutter <phil@nwl.cc>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Just like with counters the direction attribute is optional.
We set priv->dir to MAX unconditionally to avoid duplicating the assignment
for all keys with optional direction.
For keys where direction is mandatory, existing code already returns
an error.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
If NFT_EXTHDR_F_PRESENT is set, exthdr will not copy any header field
data into *dest, but instead set it to 1 if the header is found and 0
otherwise.
Signed-off-by: Phil Sutter <phil@nwl.cc>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Update the drivers to pass the RSSI level as a cfg80211_cqm_rssi_notify
parameter and pass this value to userspace in a new nl80211 attribute.
This helps both userspace and also helps in the implementation of the
multiple RSSI thresholds CQM mechanism.
Note for marvell/mwifiex I pass 0 for the RSSI value because the new
RSSI value is not available to the driver at the time of the
cfg80211_cqm_rssi_notify call, but the driver queries the new value
immediately after that, so it is actually available just a moment later
if we wanted to defer caling cfg80211_cqm_rssi_notify until that moment.
Without this, the new cfg80211 code (patch 3) will call .get_station
which will send a duplicate HostCmd_CMD_RSSI_INFO command to the hardware.
Signed-off-by: Andrew Zaborowski <andrew.zaborowski@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
The conflict was an interaction between a bug fix in the
netvsc driver in 'net' and an optimization of the RX path
in 'net-next'.
Signed-off-by: David S. Miller <davem@davemloft.net>
The big feature this time is support for POWER9 using the radix-tree
MMU for host and guest. This required some changes to arch/powerpc
code, so I talked with Michael Ellerman and he created a topic branch
with this patchset, which I merged into kvm-ppc-next and which Michael
will pull into his tree. Michael also put in some patches from Nick
Piggin which fix bugs in the interrupt vector code in relocatable
kernels when coming from a KVM guest.
Other notable changes include:
* Add the ability to change the size of the hashed page table,
from David Gibson.
* XICS (interrupt controller) emulation fixes and improvements,
from Li Zhong.
* Bug fixes from myself and Thomas Huth.
These patches define some new KVM capabilities and ioctls, but there
should be no conflicts with anything else currently upstream, as far
as I am aware.
Add a hypercall to retrieve the host realtime clock and the TSC value
used to calculate that clock read.
Used to implement clock synchronization between host and guest.
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This patch is a quick fixup of the user structures that will prevent
the structures from being different sizes on 32 and 64 bit archs.
Taking this fix will allow us to *NOT* have to do compat ioctls for
the sed code.
Signed-off-by: Scott Bauer <scott.bauer@intel.com>
Fixes: 19641f2d76 ("Include: Uapi: Add user ABI for Sed/Opal")
Signed-off-by: Jens Axboe <axboe@fb.com>
The libnetfilter_conntrack userland library always sets IPS_CONFIRMED
when building a CTA_STATUS attribute. If this toggles the bit from
0->1, the parser will return an error. On Linux 4.4+ this will cause any
NFQA_EXP attribute in the packet to be ignored. This breaks conntrackd's
userland helpers because they operate on unconfirmed connections.
Instead of returning -EBUSY if the user program asks to modify an
unchangeable bit, simply ignore the change.
Also, fix the logic so that user programs are allowed to clear
the bits that they are allowed to change.
Signed-off-by: Kevin Cernekee <cernekee@chromium.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This resolves the merge errors that were reported in linux-next and it
picks up the staging and IIO fixes that we need/want in here as well.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Pablo Neira Ayuso says:
====================
Netfilter updates for net-next
The following patchset contains Netfilter updates for your net-next
tree, they are:
1) Stash ctinfo 3-bit field into pointer to nf_conntrack object from
sk_buff so we only access one single cacheline in the conntrack
hotpath. Patchset from Florian Westphal.
2) Don't leak pointer to internal structures when exporting x_tables
ruleset back to userspace, from Willem DeBruijn. This includes new
helper functions to copy data to userspace such as xt_data_to_user()
as well as conversions of our ip_tables, ip6_tables and arp_tables
clients to use it. Not surprinsingly, ebtables requires an ad-hoc
update. There is also a new field in x_tables extensions to indicate
the amount of bytes that we copy to userspace.
3) Add nf_log_all_netns sysctl: This new knob allows you to enable
logging via nf_log infrastructure for all existing netnamespaces.
Given the effort to provide pernet syslog has been discontinued,
let's provide a way to restore logging using netfilter kernel logging
facilities in trusted environments. Patch from Michal Kubecek.
4) Validate SCTP checksum from conntrack helper, from Davide Caratti.
5) Merge UDPlite conntrack and NAT helpers into UDP, this was mostly
a copy&paste from the original helper, from Florian Westphal.
6) Reset netfilter state when duplicating packets, also from Florian.
7) Remove unnecessary check for broadcast in IPv6 in pkttype match and
nft_meta, from Liping Zhang.
8) Add missing code to deal with loopback packets from nft_meta when
used by the netdev family, also from Liping.
9) Several cleanups on nf_tables, one to remove unnecessary check from
the netlink control plane path to add table, set and stateful objects
and code consolidation when unregister chain hooks, from Gao Feng.
10) Fix harmless reference counter underflow in IPVS that, however,
results in problems with the introduction of the new refcount_t
type, from David Windsor.
11) Enable LIBCRC32C from nf_ct_sctp instead of nf_nat_sctp,
from Davide Caratti.
12) Missing documentation on nf_tables uapi header, from Liping Zhang.
13) Use rb_entry() helper in xt_connlimit, from Geliang Tang.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
New nested netlink attribute to associate tunnel info per vlan.
This is used by bridge driver to send tunnel metadata to
bridge ports in vlan tunnel mode. This patch also adds new per
port flag IFLA_BRPORT_VLAN_TUNNEL to enable vlan tunnel mode.
off by default.
One example use for this is a vxlan bridging gateway or vtep
which maps vlans to vn-segments (or vnis). User can configure
per-vlan tunnel information which the bridge driver can use
to bridge vlan into the corresponding vn-segment.
Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Vxlan COLLECT_METADATA mode today solves the per-vni netdev
scalability problem in l3 networks. It expects all forwarding
information to be present in dst_metadata. This patch series
enhances collect metadata mode to include the case where only
vni is present in dst_metadata, and the vxlan driver can then use
the rest of the forwarding information datbase to make forwarding
decisions. There is no change to default COLLECT_METADATA
behaviour. These changes only apply to COLLECT_METADATA when
used with the bridging use-case with a special dst_metadata
tunnel info flag (eg: where vxlan device is part of a bridge).
For all this to work, the vxlan driver will need to now support a
single fdb table hashed by mac + vni. This series essentially makes
this happen.
use-case and workflow:
vxlan collect metadata device participates in bridging vlan
to vn-segments. Bridge driver above the vxlan device,
sends the vni corresponding to the vlan in the dst_metadata.
vxlan driver will lookup forwarding database with (mac + vni)
for the required remote destination information to forward the
packet.
Changes introduced by this patch:
- allow learning and forwarding database state in vxlan netdev in
COLLECT_METADATA mode. Current behaviour is not changed
by default. tunnel info flag IP_TUNNEL_INFO_BRIDGE is used
to support the new bridge friendly mode.
- A single fdb table hashed by (mac, vni) to allow fdb entries with
multiple vnis in the same fdb table
- rx path already has the vni
- tx path expects a vni in the packet with dst_metadata
- prior to this series, fdb remote_dsts carried remote vni and
the vxlan device carrying the fdb table represented the
source vni. With the vxlan device now representing multiple vnis,
this patch adds a src vni attribute to the fdb entry. The remote
vni already uses NDA_VNI attribute. This patch introduces
NDA_SRC_VNI netlink attribute to represent the src vni in a multi
vni fdb table.
iproute2 example (patched and pruned iproute2 output to just show
relevant fdb entries):
example shows same host mac learnt on two vni's.
before (netdev per vni):
$bridge fdb show | grep "00:02:00:00:00:03"
00:02:00:00:00:03 dev vxlan1001 dst 12.0.0.8 self
00:02:00:00:00:03 dev vxlan1000 dst 12.0.0.8 self
after this patch with collect metadata in bridged mode (single netdev):
$bridge fdb show | grep "00:02:00:00:00:03"
00:02:00:00:00:03 dev vxlan0 src_vni 1001 dst 12.0.0.8 self
00:02:00:00:00:03 dev vxlan0 src_vni 1000 dst 12.0.0.8 self
Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use the encode/decode functionality from the ife module instead of using
implementation inside the act_ife.
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Yotam Gigi <yotamg@mellanox.com>
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Roman Mashak <mrv@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This module is responsible for the ife encapsulation protocol
encode/decode logics. That module can:
- ife_encode: encode skb and reserve space for the ife meta header
- ife_decode: decode skb and extract the meta header size
- ife_tlv_meta_encode - encodes one tlv entry into the reserved ife
header space.
- ife_tlv_meta_decode - decodes one tlv entry from the packet
- ife_tlv_meta_next - advance to the next tlv
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Yotam Gigi <yotamg@mellanox.com>
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Roman Mashak <mrv@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In the latest version of the IPv6 Segment Routing IETF draft [1] the
cleanup flag is removed and the flags field length is shrunk from 16 bits
to 8 bits. As a consequence, the input of the HMAC computation is modified
in a non-backward compatible way by covering the whole octet of flags
instead of only the cleanup bit. As such, if an implementation compatible
with the latest draft computes the HMAC of an SRH who has other flags set
to 1, then the HMAC result would differ from the current implementation.
This patch carries those modifications to prevent conflict with other
implementations of IPv6 SR.
[1] https://tools.ietf.org/html/draft-ietf-6man-segment-routing-header-05
Signed-off-by: David Lebrun <david.lebrun@uclouvain.be>
Signed-off-by: David S. Miller <davem@davemloft.net>
Debugging issues caused by pfmemalloc is often tedious.
Add a new SNMP counter to more easily diagnose these problems.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Josef Bacik <jbacik@fb.com>
Acked-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This ioctl opens a file to which a socket is bound and
returns a file descriptor. The caller has to have CAP_NET_ADMIN
in the socket network namespace.
Currently it is impossible to get a path and a mount point
for a socket file. socket_diag reports address, device ID and inode
number for unix sockets. An address can contain a relative path or
a file may be moved somewhere. And these properties say nothing about
a mount namespace and a mount point of a socket file.
With the introduced ioctl, we can get a path by reading
/proc/self/fd/X and get mnt_id from /proc/self/fdinfo/X.
In CRIU we are going to use this ioctl to dump and restore unix socket.
Here is an example how it can be used:
$ strace -e socket,bind,ioctl ./test /tmp/test_sock
socket(AF_UNIX, SOCK_STREAM, 0) = 3
bind(3, {sa_family=AF_UNIX, sun_path="test_sock"}, 11) = 0
ioctl(3, SIOCUNIXFILE, 0) = 4
^Z
$ ss -a | grep test_sock
u_str LISTEN 0 1 test_sock 17798 * 0
$ ls -l /proc/760/fd/{3,4}
lrwx------ 1 root root 64 Feb 1 09:41 3 -> 'socket:[17798]'
l--------- 1 root root 64 Feb 1 09:41 4 -> /tmp/test_sock
$ cat /proc/760/fdinfo/4
pos: 0
flags: 012000000
mnt_id: 40
$ cat /proc/self/mountinfo | grep "^40\s"
40 19 0:37 / /tmp rw shared:23 - tmpfs tmpfs rw
Signed-off-by: Andrei Vagin <avagin@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Michael Kerrisk <<mtk.manpages@gmail.com> writes:
I would like to write code that discovers the namespace setup on a live
system. The NS_GET_PARENT and NS_GET_USERNS ioctl() operations added in
Linux 4.9 provide much of what I want, but there are still a couple of
small pieces missing. Those pieces are added with this patch series.
Here's an example program that makes use of the new ioctl() operations.
8x---8x---8x---8x---8x---8x---8x---8x---8x---8x---8x---8x---8x---8x---
/* ns_capable.c
(C) 2016 Michael Kerrisk, <mtk.manpages@gmail.com>
Licensed under the GNU General Public License v2 or later.
Test whether a process (identified by PID) might (subject to LSM checks)
have capabilities in a namespace (identified by a /proc/PID/ns/xxx file).
*/
} while (0)
exit(EXIT_FAILURE); } while (0)
/* Display capabilities sets of process with specified PID */
static void
show_cap(pid_t pid)
{
cap_t caps;
char *cap_string;
caps = cap_get_pid(pid);
if (caps == NULL)
errExit("cap_get_proc");
cap_string = cap_to_text(caps, NULL);
if (cap_string == NULL)
errExit("cap_to_text");
printf("Capabilities: %s\n", cap_string);
}
/* Obtain the effective UID pf the process 'pid' by
scanning its /proc/PID/file */
static uid_t
get_euid_of_process(pid_t pid)
{
char path[PATH_MAX];
char line[1024];
int uid;
snprintf(path, sizeof(path), "/proc/%ld/status", (long) pid);
FILE *fp;
fp = fopen(path, "r");
if (fp == NULL)
errExit("fopen-/proc/PID/status");
for (;;) {
if (fgets(line, sizeof(line), fp) == NULL) {
/* Should never happen... */
fprintf(stderr, "Failure scanning %s\n", path);
exit(EXIT_FAILURE);
}
if (strstr(line, "Uid:") == line) {
sscanf(line, "Uid: %*d %d %*d %*d", &uid);
return uid;
}
}
}
int
main(int argc, char *argv[])
{
int ns_fd, userns_fd, pid_userns_fd;
int nstype;
int next_fd;
struct stat pid_stat;
struct stat target_stat;
char *pid_str;
pid_t pid;
char path[PATH_MAX];
if (argc < 2) {
fprintf(stderr, "Usage: %s PID [ns-file]\n", argv[0]);
fprintf(stderr, "\t'ns-file' is a /proc/PID/ns/xxxx file; "
"if omitted, use the namespace\n"
"\treferred to by standard input "
"(file descriptor 0)\n");
exit(EXIT_FAILURE);
}
pid_str = argv[1];
pid = atoi(pid_str);
if (argc <= 2) {
ns_fd = STDIN_FILENO;
} else {
ns_fd = open(argv[2], O_RDONLY);
if (ns_fd == -1)
errExit("open-ns-file");
}
/* Get the relevant user namespace FD, which is 'ns_fd' if 'ns_fd' refers
to a user namespace, otherwise the user namespace that owns 'ns_fd' */
nstype = ioctl(ns_fd, NS_GET_NSTYPE);
if (nstype == -1)
errExit("ioctl-NS_GET_NSTYPE");
if (nstype == CLONE_NEWUSER) {
userns_fd = ns_fd;
} else {
userns_fd = ioctl(ns_fd, NS_GET_USERNS);
if (userns_fd == -1)
errExit("ioctl-NS_GET_USERNS");
}
/* Obtain 'stat' info for the user namespace of the specified PID */
snprintf(path, sizeof(path), "/proc/%s/ns/user", pid_str);
pid_userns_fd = open(path, O_RDONLY);
if (pid_userns_fd == -1)
errExit("open-PID");
if (fstat(pid_userns_fd, &pid_stat) == -1)
errExit("fstat-PID");
/* Get 'stat' info for the target user namesapce */
if (fstat(userns_fd, &target_stat) == -1)
errExit("fstat-PID");
/* If the PID is in the target user namespace, then it has
whatever capabilities are in its sets. */
if (pid_stat.st_dev == target_stat.st_dev &&
pid_stat.st_ino == target_stat.st_ino) {
printf("PID is in target namespace\n");
printf("Subject to LSM checks, it has the following capabilities\n");
show_cap(pid);
exit(EXIT_SUCCESS);
}
/* Otherwise, we need to walk through the ancestors of the target
user namespace to see if PID is in an ancestor namespace */
for (;;) {
int f;
next_fd = ioctl(userns_fd, NS_GET_PARENT);
if (next_fd == -1) {
/* The error here should be EPERM... */
if (errno != EPERM)
errExit("ioctl-NS_GET_PARENT");
printf("PID is not in an ancestor namespace\n");
printf("It has no capabilities in the target namespace\n");
exit(EXIT_SUCCESS);
}
if (fstat(next_fd, &target_stat) == -1)
errExit("fstat-PID");
/* If the 'stat' info for this user namespace matches the 'stat'
* info for 'next_fd', then the PID is in an ancestor namespace */
if (pid_stat.st_dev == target_stat.st_dev &&
pid_stat.st_ino == target_stat.st_ino)
break;
/* Next time round, get the next parent */
f = userns_fd;
userns_fd = next_fd;
close(f);
}
/* At this point, we found that PID is in an ancestor of the target
user namespace, and 'userns_fd' refers to the immediate descendant
user namespace of PID in the chain of user namespaces from PID to
the target user namespace. If the effective UID of PID matches the
owner UID of descendant user namespace, then PID has all
capabilities in the descendant namespace(s); otherwise, it just has
the capabilities that are in its sets. */
uid_t owner_uid, uid;
if (ioctl(userns_fd, NS_GET_OWNER_UID, &owner_uid) == -1) {
perror("ioctl-NS_GET_OWNER_UID");
exit(EXIT_FAILURE);
}
uid = get_euid_of_process(pid);
printf("PID is in an ancestor namespace\n");
if (owner_uid == uid) {
printf("And its effective UID matches the owner "
"of the namespace\n");
printf("Subject to LSM checks, PID has all capabilities in "
"that namespace!\n");
} else {
printf("But its effective UID does not match the owner "
"of the namespace\n");
printf("Subject to LSM checks, it has the following capabilities\n");
show_cap(pid);
}
exit(EXIT_SUCCESS);
}
8x---8x---8x---8x---8x---8x---8x---8x---8x---8x---8x---8x---8x---8x---
Michael Kerrisk (2):
nsfs: Add an ioctl() to return the namespace type
nsfs: Add an ioctl() to return owner UID of a userns
fs/nsfs.c | 13 +++++++++++++
include/uapi/linux/nsfs.h | 9 +++++++--
2 files changed, 20 insertions(+), 2 deletions(-)
I'd like to write code that discovers the user namespace hierarchy on a
running system, and also shows who owns the various user namespaces.
Currently, there is no way of getting the owner UID of a user namespace.
Therefore, this patch adds a new NS_GET_CREATOR_UID ioctl() that fetches
the UID (as seen in the user namespace of the caller) of the creator of
the user namespace referred to by the specified file descriptor.
If the supplied file descriptor does not refer to a user namespace,
the operation fails with the error EINVAL. If the owner UID does
not have a mapping in the caller's user namespace return the
overflow UID as that appears easier to deal with in practice
in user-space applications.
-- EWB Changed the handling of unmapped UIDs from -EOVERFLOW
back to the overflow uid. Per conversation with
Michael Kerrisk after examining his test code.
Acked-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Michael Kerrisk <mtk-manpages@gmail.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
To support unprivileged users mounting filesystems two permission
checks have to be performed: a test to see if the user allowed to
create a mount in the mount namespace, and a test to see if
the user is allowed to access the specified filesystem.
The automount case is special in that mounting the original filesystem
grants permission to mount the sub-filesystems, to any user who
happens to stumble across the their mountpoint and satisfies the
ordinary filesystem permission checks.
Attempting to handle the automount case by using override_creds
almost works. It preserves the idea that permission to mount
the original filesystem is permission to mount the sub-filesystem.
Unfortunately using override_creds messes up the filesystems
ordinary permission checks.
Solve this by being explicit that a mount is a submount by introducing
vfs_submount, and using it where appropriate.
vfs_submount uses a new mount internal mount flags MS_SUBMOUNT, to let
sget and friends know that a mount is a submount so they can take appropriate
action.
sget and sget_userns are modified to not perform any permission checks
on submounts.
follow_automount is modified to stop using override_creds as that
has proven problemantic.
do_mount is modified to always remove the new MS_SUBMOUNT flag so
that we know userspace will never by able to specify it.
autofs4 is modified to stop using current_real_cred that was put in
there to handle the previous version of submount permission checking.
cifs is modified to pass the mountpoint all of the way down to vfs_submount.
debugfs is modified to pass the mountpoint all of the way down to
trace_automount by adding a new parameter. To make this change easier
a new typedef debugfs_automount_t is introduced to capture the type of
the debugfs automount function.
Cc: stable@vger.kernel.org
Fixes: 069d5ac9ae ("autofs: Fix automounts by using current_real_cred()->uid")
Fixes: aeaa4a79ff ("fs: Call d_automount with the filesystems creds")
Reviewed-by: Trond Myklebust <trond.myklebust@primarydata.com>
Reviewed-by: Seth Forshee <seth.forshee@canonical.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Chris Wilson wants the new fence tracepoint added in
commit 8c96c67801
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date: Tue Jan 24 11:57:58 2017 +0000
dma/fence: Export enable-signaling tracepoint for emission by drivers
Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>