Steffen Klassert says:
====================
pull request (net-next): ipsec-next 2022-05-13
1) Cleanups for the code behind the XFRM offload API. This is a
preparation for the extension of the API for policy offload.
From Leon Romanovsky.
* 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next:
xfrm: drop not needed flags variable in XFRM offload struct
net/mlx5e: Use XFRM state direction instead of flags
netdevsim: rely on XFRM state direction instead of flags
ixgbe: propagate XFRM offload state direction instead of flags
xfrm: store and rely on direction to construct offload flags
xfrm: rename xfrm_state_offload struct to allow reuse
xfrm: delete not used number of external headers
xfrm: free not used XFRM_ESP_NO_TRAILER flag
====================
Link: https://lore.kernel.org/r/20220513151218.4010119-1-steffen.klassert@secunet.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Pull misc fixes from Andrew Morton:
"Seven MM fixes, three of which address issues added in the most recent
merge window, four of which are cc:stable.
Three non-MM fixes, none very serious"
[ And yes, that's a real pull request from Andrew, not me creating a
branch from emailed patches. Woo-hoo! ]
* tag 'mm-hotfixes-stable-2022-05-11' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
MAINTAINERS: add a mailing list for DAMON development
selftests: vm: Makefile: rename TARGETS to VMTARGETS
mm/kfence: reset PG_slab and memcg_data before freeing __kfence_pool
mailmap: add entry for martyna.szapar-mudlaw@intel.com
arm[64]/memremap: don't abuse pfn_valid() to ensure presence of linear map
procfs: prevent unprivileged processes accessing fdinfo dir
mm: mremap: fix sign for EFAULT error return value
mm/hwpoison: use pr_err() instead of dump_page() in get_any_page()
mm/huge_memory: do not overkill when splitting huge_zero_page
Revert "mm/memory-failure.c: skip huge_zero_page in memory_failure()"
If CONFIG_PTP_1588_CLOCK=m and CONFIG_SFC_SIENA=y, the siena driver will fail to link:
drivers/net/ethernet/sfc/siena/ptp.o: In function `efx_ptp_remove_channel':
ptp.c:(.text+0xa28): undefined reference to `ptp_clock_unregister'
drivers/net/ethernet/sfc/siena/ptp.o: In function `efx_ptp_probe_channel':
ptp.c:(.text+0x13a0): undefined reference to `ptp_clock_register'
ptp.c:(.text+0x1470): undefined reference to `ptp_clock_unregister'
drivers/net/ethernet/sfc/siena/ptp.o: In function `efx_ptp_pps_worker':
ptp.c:(.text+0x1d29): undefined reference to `ptp_clock_event'
drivers/net/ethernet/sfc/siena/ptp.o: In function `efx_siena_ptp_get_ts_info':
ptp.c:(.text+0x301b): undefined reference to `ptp_clock_index'
To fix this build error, make SFC_SIENA depends on PTP_1588_CLOCK.
Reported-by: Hulk Robot <hulkci@huawei.com>
Fixes: d48523cb88 ("sfc: Copy shared files needed for Siena (part 2)")
Signed-off-by: Ren Zhijie <renzhijie2@huawei.com>
Acked-by: Martin Habets <habetsm.xilinx@gmail.com>
Link: https://lore.kernel.org/r/20220513012721.140871-1-renzhijie2@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Pull hwmon fixes from Guenter Roeck:
- Restrict ltq-cputemp to SOC_XWAY to fix build failure
- Add OF device ID table to tmp401 driver to enable auto-load
* tag 'hwmon-for-v5.18-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/groeck/linux-staging:
hwmon: (ltq-cputemp) restrict it to SOC_XWAY
hwmon: (tmp401) Add OF device ID table
Pull drm fixes from Dave Airlie:
"Pretty quiet week on the fixes front, 4 amdgpu and one i915 fix.
I think there might be a few misc fbdev ones outstanding, but I'll see
if they are necessary and pass them on if so.
amdgpu:
- Disable ASPM for VI boards on ADL platforms
- S0ix DCN3.1 display fix
- Resume regression fix
- Stable pstate fix
i915:
- fix for kernel memory corruption when running a lot of OpenCL tests
in parallel"
* tag 'drm-fixes-2022-05-13' of git://anongit.freedesktop.org/drm/drm:
drm/amdgpu/ctx: only reset stable pstate if the user changed it (v2)
Revert "drm/amd/pm: keep the BACO feature enabled for suspend"
drm/i915: Fix race in __i915_vma_remove_closed
drm/amd/display: undo clearing of z10 related function pointers
drm/amdgpu: vi: disable ASPM on Intel Alder Lake based systems
[ Similarly to commit a765ed47e4 ("PCI: hv: Fix synchronization
between channel callback and hv_compose_msi_msg()"): ]
The (on-stack) teardown packet becomes invalid once the completion
timeout in hv_pci_bus_exit() has expired and hv_pci_bus_exit() has
returned. Prevent the channel callback from accessing the invalid
packet by removing the ID associated to such packet from the VMbus
requestor in hv_pci_bus_exit().
Signed-off-by: Andrea Parri (Microsoft) <parri.andrea@gmail.com>
Reviewed-by: Michael Kelley <mikelley@microsoft.com>
Acked-by: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>
Link: https://lore.kernel.org/r/20220511223207.3386-3-parri.andrea@gmail.com
Signed-off-by: Wei Liu <wei.liu@kernel.org>
For additional robustness in the face of Hyper-V errors or malicious
behavior, validate all values that originate from packets that Hyper-V
has sent to the guest in the host-to-guest ring buffer. Ensure that
invalid values cannot cause data being copied out of the bounds of the
source buffer in hv_pci_onchannelcallback().
While at it, remove a redundant validation in hv_pci_generic_compl():
hv_pci_onchannelcallback() already ensures that all processed incoming
packets are "at least as large as [in fact larger than] a response".
Signed-off-by: Andrea Parri (Microsoft) <parri.andrea@gmail.com>
Reviewed-by: Michael Kelley <mikelley@microsoft.com>
Acked-by: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>
Link: https://lore.kernel.org/r/20220511223207.3386-2-parri.andrea@gmail.com
Signed-off-by: Wei Liu <wei.liu@kernel.org>
The checksum is optional for UDP packets. However nf_reject would
previously require a valid checksum to elicit a response such as
ICMP_DEST_UNREACH.
Add some logic to nf_reject_verify_csum to determine if a UDP packet has
a zero checksum and should therefore not be verified.
Signed-off-by: Kevin Mitchell <kevmitch@arista.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
When creating a flow table entry, the reverse route is looked
up based on the current packet.
There can be scenarios where the user creates a custom ip rule
to route the traffic differently.
In order to support those scenarios, the lookup needs to add
more information based on the current packet.
The patch adds multiple new information to the route lookup.
Signed-off-by: Sven Auhagen <sven.auhagen@voleatech.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
The pointer check usually results in a 'false positive': its likely
that the ctnetlink module is loaded but no event monitoring is enabled.
After recent change to autodetect ctnetlink usage and only allocate
the ecache extension if a listener is active, check if the extension
is present on a given conntrack.
If its not there, there is nothing to report and calls to the
notification framework can be elided.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This adds the new nf_conntrack_events=2 mode and makes it the
default.
This leverages the earlier flag in struct net to allow to avoid
the event extension as long as no event listener is active in
the namespace.
This avoids, for most cases, allocation of ct->ext area.
A followup patch will take further advantage of this by avoiding
calls down into the event framework if the extension isn't present.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Only called when new ct is allocated or the extension isn't present.
This function will be extended, place this in the conntrack module
instead of inlining.
The callers already depend on nf_conntrack module.
Return value is changed to bool, noone used the returned pointer.
Make sure that the core drops the newly allocated conntrack
if the extension is requested but can't be added.
This makes it necessary to ifdef the section, as the stub
always returns false we'd drop every new conntrack if the
the ecache extension is disabled in kconfig.
Add from data path (xt_CT, nft_ct) is unchanged.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
At this time, every new conntrack gets the 'event cache extension'
enabled for it.
This is because the 'net.netfilter.nf_conntrack_events' sysctl defaults
to 1.
Changing the default to 0 means that commands that rely on the event
notification extension, e.g. 'conntrack -E' or conntrackd, stop working.
We COULD detect if there is a listener by means of
'nfnetlink_has_listeners()' and only add the extension if this is true.
The downside is a dependency from conntrack module to nfnetlink module.
This adds a different way: inc/dec a counter whenever a ctnetlink group
is being (un)subscribed and toggle a flag in struct net.
Next patches will take advantage of this and will only add the event
extension if the flag is set.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This patch adds a structure to collect all the context data that is
passed to the cleanup iterator.
struct nf_ct_iter_data {
struct net *net;
void *data;
u32 portid;
int report;
};
There is a netns field that allows to clean up conntrack entries
specifically owned by the specified netns.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Now that the conntrack entry isn't placed on the pcpu list anymore the
bh only needs to be disabled in the 'expectation present' case.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Its not needed anymore:
A. If entry is totally new, then the rcu-protected resource
must already have been removed from global visibility before call
to nf_ct_iterate_destroy.
B. If entry was allocated before, but is not yet in the hash table
(uncofirmed case), genid gets incremented and synchronize_rcu() call
makes sure access has completed.
C. Next attempt to peek at extension area will fail for unconfirmed
conntracks, because ext->genid != genid.
D. Conntracks in the hash are iterated as before.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Increment the extid on module removal; this makes sure that even
in extreme cases any old uncofirmed entry that happened to be kept
e.g. on nfnetlink_queue list will not trip over a stale timeout
reference.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Multiple netfilter extensions store pointers to external data
in their extension area struct.
Examples:
1. Timeout policies
2. Connection tracking helpers.
No references are taken for these.
When a helper or timeout policy is removed, the conntrack table gets
traversed and affected extensions are cleared.
Conntrack entries not yet in the hashtable are referenced via a special
list, the unconfirmed list.
On removal of a policy or connection tracking helper, the unconfirmed
list gets traversed an all entries are marked as dying, this prevents
them from getting committed to the table at insertion time: core checks
for dying bit, if set, the conntrack entry gets destroyed at confirm
time.
The disadvantage is that each new conntrack has to be added to the percpu
unconfirmed list, and each insertion needs to remove it from this list.
The list is only ever needed when a policy or helper is removed -- a rare
occurrence.
Add a generation ID count: Instead of adding to the list and then
traversing that list on policy/helper removal, increment a counter
that is stored in the extension area.
For unconfirmed conntracks, the extension has the genid valid at ct
allocation time.
Removal of a helper/policy etc. increments the counter.
At confirmation time, validate that ext->genid == global_id.
If the stored number is not the same, do not allow the conntrack
insertion, just like as if a confirmed-list traversal would have flagged
the entry as dying.
After insertion, the genid is no longer relevant (conntrack entries
are now reachable via the conntrack table iterators and is set to 0.
This allows removal of the percpu unconfirmed list.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This helper tags connections not yet in the conntrack table as
dying. These nf_conn entries will be dropped instead when the
core attempts to insert them from the input or postrouting
'confirm' hook.
After the previous change, the entries get unlinked from the
list earlier, so that by the time the actual exit hook runs,
new connections no longer have a timeout policy assigned.
Its enough to walk the hashtable instead.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Make it so netns pre_exit unlinks the objects from the pernet list, so
they cannot be found anymore.
netns core issues a synchronize_rcu() before calling the exit hooks so
any the time the exit hooks run unconfirmed nf_conn entries have been
free'd or they have been committed to the hashtable.
The exit hook still tags unconfirmed entries as dying, this can
now be removed in a followup change.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Its no longer needed. Entries that need event redelivery are placed
on the new pernet dying list.
The advantage is that there is no need to take additional spinlock on
conntrack removal unless event redelivery failed or the conntrack entry
was never added to the table in the first place (confirmed bit not set).
The IPS_CONFIRMED bit now needs to be set as soon as the entry has been
unlinked from the unconfirmed list, else the destroy function may
attempt to unlink it a second time.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
The new pernet dying list includes conntrack entries that await
delivery of the 'destroy' event via ctnetlink.
The old percpu dying list will be removed soon.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This disentangles event redelivery and the percpu dying list.
Because entries are now stored on a dedicated list, all
entries are in NFCT_ECACHE_DESTROY_FAIL state and all entries
still have confirmed bit set -- the reference count is at least 1.
The 'struct net' back-pointer can be removed as well.
The pcpu dying list will be removed eventually, it has no functionality.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
VFIO PCI does a security check as part of hot reset to prove that the user
has permission to manipulate all the devices that will be impacted by the
reset.
Use a new API vfio_file_has_dev() to perform this security check against
the struct file directly and remove the vfio_group from VFIO PCI.
Since VFIO PCI was the last user of vfio_group_get_external_user() and
vfio_group_put_external_user() remove it as well.
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/8-v3-f7729924a7ea+25e33-vfio_kvm_no_group_jgg@nvidia.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
None of the VFIO APIs take in the vfio_group anymore, so we can remove it
completely.
This has a subtle side effect on the enforced coherency tracking. The
vfio_group_get_external_user() was holding on to the container_users which
would prevent the iommu_domain and thus the enforced coherency value from
changing while the group is registered with kvm.
It changes the security proof slightly into 'user must hold a group FD
that has a device that cannot enforce DMA coherence'. As opening the group
FD, not attaching the container, is the privileged operation this doesn't
change the security properties much.
On the flip side it paves the way to changing the iommu_domain/container
attached to a group at runtime which is something that will be required to
support nested translation.
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>i
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/7-v3-f7729924a7ea+25e33-vfio_kvm_no_group_jgg@nvidia.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
vfio_group_fops_open() ensures there is only ever one struct file open for
any struct vfio_group at any time:
/* Do we need multiple instances of the group open? Seems not. */
opened = atomic_cmpxchg(&group->opened, 0, 1);
if (opened) {
vfio_group_put(group);
return -EBUSY;
Therefor the struct file * can be used directly to search the list of VFIO
groups that KVM keeps instead of using the
vfio_external_group_match_file() callback to try to figure out if the
passed in FD matches the list or not.
Delete vfio_external_group_match_file().
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/4-v3-f7729924a7ea+25e33-vfio_kvm_no_group_jgg@nvidia.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
The only caller wants to get a pointer to the struct iommu_group
associated with the VFIO group file. Instead of returning the group ID
then searching sysfs for that string to get the struct iommu_group just
directly return the iommu_group pointer already held by the vfio_group
struct.
It already has a safe lifetime due to the struct file kref, the vfio_group
and thus the iommu_group cannot be destroyed while the group file is open.
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/3-v3-f7729924a7ea+25e33-vfio_kvm_no_group_jgg@nvidia.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Add c-level handler for the division by zero exception and kill the task
if it was thrown from the kernel space or send SIGFPE otherwise.
Signed-off-by: Max Filippov <jcmvbkbc@gmail.com>
mvebu arm for 5.19 (part 1)
Fix typos in comment on orion5x files
* tag 'mvebu-arm-5.19-1' of git://git.kernel.org/pub/scm/linux/kernel/git/gclement/mvebu:
orion5x: fix typos in comments
Link: https://lore.kernel.org/r/87o801r2ss.fsf@BL-laptop
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
With very limited vram on svga3 it's difficult to handle all the surface
migrations. Without gbobjects, i.e. the ability to store surfaces in
guest mobs, there's no reason to support intermediate svga2 features,
especially because we can fall back to fb traces and svga3 will never
support those in-between features.
On svga3 we wither want to use fb traces or screen targets
(i.e. gbobjects), nothing in between. This fixes presentation on a lot
of fusion/esxi tech previews where the exposed svga3 caps haven't been
finalized yet.
Signed-off-by: Zack Rusin <zackr@vmware.com>
Fixes: 2cd80dbd35 ("drm/vmwgfx: Add basic support for SVGA3")
Cc: <stable@vger.kernel.org> # v5.14+
Reviewed-by: Martin Krastev <krastevm@vmware.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20220318174332.440068-5-zack@kde.org
Port of the vmwgfx to SVGAv3 lacked support for fencing. SVGAv3 removed
FIFO's and replaced them with command buffers and extra registers.
The initial version of SVGAv3 lacked support for most advanced features
(e.g. 3D) which made fences unnecessary. That is no longer the case,
especially as 3D support is being turned on.
Switch from FIFO commands and capabilities to command buffers and extra
registers to enable fences on SVGAv3.
Fixes: 2cd80dbd35 ("drm/vmwgfx: Add basic support for SVGA3")
Signed-off-by: Zack Rusin <zackr@vmware.com>
Reviewed-by: Martin Krastev <krastevm@vmware.com>
Reviewed-by: Maaz Mombasawala <mombasawalam@vmware.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20220302152426.885214-5-zack@kde.org
Don't decrease the number of poisoned pages in page_alloc.c, let the
memory-failure.c do inc/dec poisoned pages only.
Also simplify unpoison_memory(), only decrease the number of
poisoned pages when:
- TestClearPageHWPoison() succeed
- put_page_back_buddy succeed
After decreasing, print necessary log.
Finally, remove clear_page_hwpoison() and unpoison_taken_off_page().
Link: https://lkml.kernel.org/r/20220509105641.491313-3-pizhenwei@bytedance.com
Signed-off-by: zhenwei pi <pizhenwei@bytedance.com>
Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "memory-failure: fix hwpoison_filter", v2.
As well known, the memory failure mechanism handles memory corrupted
event, and try to send SIGBUS to the user process which uses this
corrupted page.
For the virtualization case, QEMU catches SIGBUS and tries to inject MCE
into the guest, and the guest handles memory failure again. Thus the
guest gets the minimal effect from hardware memory corruption.
The further step I'm working on:
1, try to modify code to decrease poisoned pages in a single place
(mm/memofy-failure.c: simplify num_poisoned_pages_dec in this series).
2, try to use page_handle_poison() to handle SetPageHWPoison() and
num_poisoned_pages_inc() together. It would be best to call
num_poisoned_pages_inc() in a single place too.
3, introduce memory failure notifier list in memory-failure.c: notify
the corrupted PFN to someone who registers this list. If I can
complete [1] and [2] part, [3] will be quite easy(just call notifier
list after increasing poisoned page).
4, introduce memory recover VQ for memory balloon device, and registers
memory failure notifier list. During the guest kernel handles memory
failure, balloon device gets notified by memory failure notifier list,
and tells the host to recover the corrupted PFN(GPA) by the new VQ.
5, host side remaps the corrupted page(HVA), and tells the guest side
to unpoison the PFN(GPA). Then the guest fixes the corrupted page(GPA)
dynamically.
This patch (of 5):
clear_hwpoisoned_pages() clears HWPoison flag and decreases the number of
poisoned pages, this actually works as part of memory failure.
Move this function from sparse.c to memory-failure.c, finally there is no
CONFIG_MEMORY_FAILURE in sparse.c.
Link: https://lkml.kernel.org/r/20220509105641.491313-1-pizhenwei@bytedance.com
Link: https://lkml.kernel.org/r/20220509105641.491313-2-pizhenwei@bytedance.com
Signed-off-by: zhenwei pi <pizhenwei@bytedance.com>
Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>