932f4a630a
Pach series "Add FOLL_LONGTERM to GUP fast and use it". HFI1, qib, and mthca, use get_user_pages_fast() due to its performance advantages. These pages can be held for a significant time. But get_user_pages_fast() does not protect against mapping FS DAX pages. Introduce FOLL_LONGTERM and use this flag in get_user_pages_fast() which retains the performance while also adding the FS DAX checks. XDP has also shown interest in using this functionality.[1] In addition we change get_user_pages() to use the new FOLL_LONGTERM flag and remove the specialized get_user_pages_longterm call. [1] https://lkml.org/lkml/2019/3/19/939 "longterm" is a relative thing and at this point is probably a misnomer. This is really flagging a pin which is going to be given to hardware and can't move. I've thought of a couple of alternative names but I think we have to settle on if we are going to use FL_LAYOUT or something else to solve the "longterm" problem. Then I think we can change the flag to a better name. Secondly, it depends on how often you are registering memory. I have spoken with some RDMA users who consider MR in the performance path... For the overall application performance. I don't have the numbers as the tests for HFI1 were done a long time ago. But there was a significant advantage. Some of which is probably due to the fact that you don't have to hold mmap_sem. Finally, architecturally I think it would be good for everyone to use *_fast. There are patches submitted to the RDMA list which would allow the use of *_fast (they reworking the use of mmap_sem) and as soon as they are accepted I'll submit a patch to convert the RDMA core as well. Also to this point others are looking to use *_fast. As an aside, Jasons pointed out in my previous submission that *_fast and *_unlocked look very much the same. I agree and I think further cleanup will be coming. But I'm focused on getting the final solution for DAX at the moment. This patch (of 7): This patch starts a series which aims to support FOLL_LONGTERM in get_user_pages_fast(). Some callers who would like to do a longterm (user controlled pin) of pages with the fast variant of GUP for performance purposes. Rather than have a separate get_user_pages_longterm() call, introduce FOLL_LONGTERM and change the longterm callers to use it. This patch does not change any functionality. In the short term "longterm" or user controlled pins are unsafe for Filesystems and FS DAX in particular has been blocked. However, callers of get_user_pages_fast() were not "protected". FOLL_LONGTERM can _only_ be supported with get_user_pages[_fast]() as it requires vmas to determine if DAX is in use. NOTE: In merging with the CMA changes we opt to change the get_user_pages() call in check_and_migrate_cma_pages() to a call of __get_user_pages_locked() on the newly migrated pages. This makes the code read better in that we are calling __get_user_pages_locked() on the pages before and after a potential migration. As a side affect some of the interfaces are cleaned up but this is not the primary purpose of the series. In review[1] it was asked: <quote> > This I don't get - if you do lock down long term mappings performance > of the actual get_user_pages call shouldn't matter to start with. > > What do I miss? A couple of points. First "longterm" is a relative thing and at this point is probably a misnomer. This is really flagging a pin which is going to be given to hardware and can't move. I've thought of a couple of alternative names but I think we have to settle on if we are going to use FL_LAYOUT or something else to solve the "longterm" problem. Then I think we can change the flag to a better name. Second, It depends on how often you are registering memory. I have spoken with some RDMA users who consider MR in the performance path... For the overall application performance. I don't have the numbers as the tests for HFI1 were done a long time ago. But there was a significant advantage. Some of which is probably due to the fact that you don't have to hold mmap_sem. Finally, architecturally I think it would be good for everyone to use *_fast. There are patches submitted to the RDMA list which would allow the use of *_fast (they reworking the use of mmap_sem) and as soon as they are accepted I'll submit a patch to convert the RDMA core as well. Also to this point others are looking to use *_fast. As an asside, Jasons pointed out in my previous submission that *_fast and *_unlocked look very much the same. I agree and I think further cleanup will be coming. But I'm focused on getting the final solution for DAX at the moment. </quote> [1] https://lore.kernel.org/lkml/20190220180255.GA12020@iweiny-DESK2.sc.intel.com/T/#md6abad2569f3bf6c1f03686c8097ab6563e94965 [ira.weiny@intel.com: v3] Link: http://lkml.kernel.org/r/20190328084422.29911-2-ira.weiny@intel.com Link: http://lkml.kernel.org/r/20190328084422.29911-2-ira.weiny@intel.com Link: http://lkml.kernel.org/r/20190317183438.2057-2-ira.weiny@intel.com Signed-off-by: Ira Weiny <ira.weiny@intel.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: John Hubbard <jhubbard@nvidia.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: "David S. Miller" <davem@davemloft.net> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Rich Felker <dalias@libc.org> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: James Hogan <jhogan@kernel.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Mike Marshall <hubcap@omnibond.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
413 lines
8.4 KiB
C
413 lines
8.4 KiB
C
// SPDX-License-Identifier: GPL-2.0
|
|
/* XDP user-space packet buffer
|
|
* Copyright(c) 2018 Intel Corporation.
|
|
*/
|
|
|
|
#include <linux/init.h>
|
|
#include <linux/sched/mm.h>
|
|
#include <linux/sched/signal.h>
|
|
#include <linux/sched/task.h>
|
|
#include <linux/uaccess.h>
|
|
#include <linux/slab.h>
|
|
#include <linux/bpf.h>
|
|
#include <linux/mm.h>
|
|
#include <linux/netdevice.h>
|
|
#include <linux/rtnetlink.h>
|
|
#include <linux/idr.h>
|
|
|
|
#include "xdp_umem.h"
|
|
#include "xsk_queue.h"
|
|
|
|
#define XDP_UMEM_MIN_CHUNK_SIZE 2048
|
|
|
|
static DEFINE_IDA(umem_ida);
|
|
|
|
void xdp_add_sk_umem(struct xdp_umem *umem, struct xdp_sock *xs)
|
|
{
|
|
unsigned long flags;
|
|
|
|
spin_lock_irqsave(&umem->xsk_list_lock, flags);
|
|
list_add_rcu(&xs->list, &umem->xsk_list);
|
|
spin_unlock_irqrestore(&umem->xsk_list_lock, flags);
|
|
}
|
|
|
|
void xdp_del_sk_umem(struct xdp_umem *umem, struct xdp_sock *xs)
|
|
{
|
|
unsigned long flags;
|
|
|
|
spin_lock_irqsave(&umem->xsk_list_lock, flags);
|
|
list_del_rcu(&xs->list);
|
|
spin_unlock_irqrestore(&umem->xsk_list_lock, flags);
|
|
}
|
|
|
|
/* The umem is stored both in the _rx struct and the _tx struct as we do
|
|
* not know if the device has more tx queues than rx, or the opposite.
|
|
* This might also change during run time.
|
|
*/
|
|
static int xdp_reg_umem_at_qid(struct net_device *dev, struct xdp_umem *umem,
|
|
u16 queue_id)
|
|
{
|
|
if (queue_id >= max_t(unsigned int,
|
|
dev->real_num_rx_queues,
|
|
dev->real_num_tx_queues))
|
|
return -EINVAL;
|
|
|
|
if (queue_id < dev->real_num_rx_queues)
|
|
dev->_rx[queue_id].umem = umem;
|
|
if (queue_id < dev->real_num_tx_queues)
|
|
dev->_tx[queue_id].umem = umem;
|
|
|
|
return 0;
|
|
}
|
|
|
|
struct xdp_umem *xdp_get_umem_from_qid(struct net_device *dev,
|
|
u16 queue_id)
|
|
{
|
|
if (queue_id < dev->real_num_rx_queues)
|
|
return dev->_rx[queue_id].umem;
|
|
if (queue_id < dev->real_num_tx_queues)
|
|
return dev->_tx[queue_id].umem;
|
|
|
|
return NULL;
|
|
}
|
|
EXPORT_SYMBOL(xdp_get_umem_from_qid);
|
|
|
|
static void xdp_clear_umem_at_qid(struct net_device *dev, u16 queue_id)
|
|
{
|
|
if (queue_id < dev->real_num_rx_queues)
|
|
dev->_rx[queue_id].umem = NULL;
|
|
if (queue_id < dev->real_num_tx_queues)
|
|
dev->_tx[queue_id].umem = NULL;
|
|
}
|
|
|
|
int xdp_umem_assign_dev(struct xdp_umem *umem, struct net_device *dev,
|
|
u16 queue_id, u16 flags)
|
|
{
|
|
bool force_zc, force_copy;
|
|
struct netdev_bpf bpf;
|
|
int err = 0;
|
|
|
|
force_zc = flags & XDP_ZEROCOPY;
|
|
force_copy = flags & XDP_COPY;
|
|
|
|
if (force_zc && force_copy)
|
|
return -EINVAL;
|
|
|
|
rtnl_lock();
|
|
if (xdp_get_umem_from_qid(dev, queue_id)) {
|
|
err = -EBUSY;
|
|
goto out_rtnl_unlock;
|
|
}
|
|
|
|
err = xdp_reg_umem_at_qid(dev, umem, queue_id);
|
|
if (err)
|
|
goto out_rtnl_unlock;
|
|
|
|
umem->dev = dev;
|
|
umem->queue_id = queue_id;
|
|
if (force_copy)
|
|
/* For copy-mode, we are done. */
|
|
goto out_rtnl_unlock;
|
|
|
|
if (!dev->netdev_ops->ndo_bpf ||
|
|
!dev->netdev_ops->ndo_xsk_async_xmit) {
|
|
err = -EOPNOTSUPP;
|
|
goto err_unreg_umem;
|
|
}
|
|
|
|
bpf.command = XDP_SETUP_XSK_UMEM;
|
|
bpf.xsk.umem = umem;
|
|
bpf.xsk.queue_id = queue_id;
|
|
|
|
err = dev->netdev_ops->ndo_bpf(dev, &bpf);
|
|
if (err)
|
|
goto err_unreg_umem;
|
|
rtnl_unlock();
|
|
|
|
dev_hold(dev);
|
|
umem->zc = true;
|
|
return 0;
|
|
|
|
err_unreg_umem:
|
|
if (!force_zc)
|
|
err = 0; /* fallback to copy mode */
|
|
if (err)
|
|
xdp_clear_umem_at_qid(dev, queue_id);
|
|
out_rtnl_unlock:
|
|
rtnl_unlock();
|
|
return err;
|
|
}
|
|
|
|
static void xdp_umem_clear_dev(struct xdp_umem *umem)
|
|
{
|
|
struct netdev_bpf bpf;
|
|
int err;
|
|
|
|
if (umem->zc) {
|
|
bpf.command = XDP_SETUP_XSK_UMEM;
|
|
bpf.xsk.umem = NULL;
|
|
bpf.xsk.queue_id = umem->queue_id;
|
|
|
|
rtnl_lock();
|
|
err = umem->dev->netdev_ops->ndo_bpf(umem->dev, &bpf);
|
|
rtnl_unlock();
|
|
|
|
if (err)
|
|
WARN(1, "failed to disable umem!\n");
|
|
}
|
|
|
|
if (umem->dev) {
|
|
rtnl_lock();
|
|
xdp_clear_umem_at_qid(umem->dev, umem->queue_id);
|
|
rtnl_unlock();
|
|
}
|
|
|
|
if (umem->zc) {
|
|
dev_put(umem->dev);
|
|
umem->zc = false;
|
|
}
|
|
}
|
|
|
|
static void xdp_umem_unpin_pages(struct xdp_umem *umem)
|
|
{
|
|
unsigned int i;
|
|
|
|
for (i = 0; i < umem->npgs; i++) {
|
|
struct page *page = umem->pgs[i];
|
|
|
|
set_page_dirty_lock(page);
|
|
put_page(page);
|
|
}
|
|
|
|
kfree(umem->pgs);
|
|
umem->pgs = NULL;
|
|
}
|
|
|
|
static void xdp_umem_unaccount_pages(struct xdp_umem *umem)
|
|
{
|
|
if (umem->user) {
|
|
atomic_long_sub(umem->npgs, &umem->user->locked_vm);
|
|
free_uid(umem->user);
|
|
}
|
|
}
|
|
|
|
static void xdp_umem_release(struct xdp_umem *umem)
|
|
{
|
|
xdp_umem_clear_dev(umem);
|
|
|
|
ida_simple_remove(&umem_ida, umem->id);
|
|
|
|
if (umem->fq) {
|
|
xskq_destroy(umem->fq);
|
|
umem->fq = NULL;
|
|
}
|
|
|
|
if (umem->cq) {
|
|
xskq_destroy(umem->cq);
|
|
umem->cq = NULL;
|
|
}
|
|
|
|
xsk_reuseq_destroy(umem);
|
|
|
|
xdp_umem_unpin_pages(umem);
|
|
|
|
kfree(umem->pages);
|
|
umem->pages = NULL;
|
|
|
|
xdp_umem_unaccount_pages(umem);
|
|
kfree(umem);
|
|
}
|
|
|
|
static void xdp_umem_release_deferred(struct work_struct *work)
|
|
{
|
|
struct xdp_umem *umem = container_of(work, struct xdp_umem, work);
|
|
|
|
xdp_umem_release(umem);
|
|
}
|
|
|
|
void xdp_get_umem(struct xdp_umem *umem)
|
|
{
|
|
refcount_inc(&umem->users);
|
|
}
|
|
|
|
void xdp_put_umem(struct xdp_umem *umem)
|
|
{
|
|
if (!umem)
|
|
return;
|
|
|
|
if (refcount_dec_and_test(&umem->users)) {
|
|
INIT_WORK(&umem->work, xdp_umem_release_deferred);
|
|
schedule_work(&umem->work);
|
|
}
|
|
}
|
|
|
|
static int xdp_umem_pin_pages(struct xdp_umem *umem)
|
|
{
|
|
unsigned int gup_flags = FOLL_WRITE;
|
|
long npgs;
|
|
int err;
|
|
|
|
umem->pgs = kcalloc(umem->npgs, sizeof(*umem->pgs),
|
|
GFP_KERNEL | __GFP_NOWARN);
|
|
if (!umem->pgs)
|
|
return -ENOMEM;
|
|
|
|
down_read(¤t->mm->mmap_sem);
|
|
npgs = get_user_pages(umem->address, umem->npgs,
|
|
gup_flags | FOLL_LONGTERM, &umem->pgs[0], NULL);
|
|
up_read(¤t->mm->mmap_sem);
|
|
|
|
if (npgs != umem->npgs) {
|
|
if (npgs >= 0) {
|
|
umem->npgs = npgs;
|
|
err = -ENOMEM;
|
|
goto out_pin;
|
|
}
|
|
err = npgs;
|
|
goto out_pgs;
|
|
}
|
|
return 0;
|
|
|
|
out_pin:
|
|
xdp_umem_unpin_pages(umem);
|
|
out_pgs:
|
|
kfree(umem->pgs);
|
|
umem->pgs = NULL;
|
|
return err;
|
|
}
|
|
|
|
static int xdp_umem_account_pages(struct xdp_umem *umem)
|
|
{
|
|
unsigned long lock_limit, new_npgs, old_npgs;
|
|
|
|
if (capable(CAP_IPC_LOCK))
|
|
return 0;
|
|
|
|
lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
|
|
umem->user = get_uid(current_user());
|
|
|
|
do {
|
|
old_npgs = atomic_long_read(&umem->user->locked_vm);
|
|
new_npgs = old_npgs + umem->npgs;
|
|
if (new_npgs > lock_limit) {
|
|
free_uid(umem->user);
|
|
umem->user = NULL;
|
|
return -ENOBUFS;
|
|
}
|
|
} while (atomic_long_cmpxchg(&umem->user->locked_vm, old_npgs,
|
|
new_npgs) != old_npgs);
|
|
return 0;
|
|
}
|
|
|
|
static int xdp_umem_reg(struct xdp_umem *umem, struct xdp_umem_reg *mr)
|
|
{
|
|
u32 chunk_size = mr->chunk_size, headroom = mr->headroom;
|
|
unsigned int chunks, chunks_per_page;
|
|
u64 addr = mr->addr, size = mr->len;
|
|
int size_chk, err, i;
|
|
|
|
if (chunk_size < XDP_UMEM_MIN_CHUNK_SIZE || chunk_size > PAGE_SIZE) {
|
|
/* Strictly speaking we could support this, if:
|
|
* - huge pages, or*
|
|
* - using an IOMMU, or
|
|
* - making sure the memory area is consecutive
|
|
* but for now, we simply say "computer says no".
|
|
*/
|
|
return -EINVAL;
|
|
}
|
|
|
|
if (!is_power_of_2(chunk_size))
|
|
return -EINVAL;
|
|
|
|
if (!PAGE_ALIGNED(addr)) {
|
|
/* Memory area has to be page size aligned. For
|
|
* simplicity, this might change.
|
|
*/
|
|
return -EINVAL;
|
|
}
|
|
|
|
if ((addr + size) < addr)
|
|
return -EINVAL;
|
|
|
|
chunks = (unsigned int)div_u64(size, chunk_size);
|
|
if (chunks == 0)
|
|
return -EINVAL;
|
|
|
|
chunks_per_page = PAGE_SIZE / chunk_size;
|
|
if (chunks < chunks_per_page || chunks % chunks_per_page)
|
|
return -EINVAL;
|
|
|
|
headroom = ALIGN(headroom, 64);
|
|
|
|
size_chk = chunk_size - headroom - XDP_PACKET_HEADROOM;
|
|
if (size_chk < 0)
|
|
return -EINVAL;
|
|
|
|
umem->address = (unsigned long)addr;
|
|
umem->chunk_mask = ~((u64)chunk_size - 1);
|
|
umem->size = size;
|
|
umem->headroom = headroom;
|
|
umem->chunk_size_nohr = chunk_size - headroom;
|
|
umem->npgs = size / PAGE_SIZE;
|
|
umem->pgs = NULL;
|
|
umem->user = NULL;
|
|
INIT_LIST_HEAD(&umem->xsk_list);
|
|
spin_lock_init(&umem->xsk_list_lock);
|
|
|
|
refcount_set(&umem->users, 1);
|
|
|
|
err = xdp_umem_account_pages(umem);
|
|
if (err)
|
|
return err;
|
|
|
|
err = xdp_umem_pin_pages(umem);
|
|
if (err)
|
|
goto out_account;
|
|
|
|
umem->pages = kcalloc(umem->npgs, sizeof(*umem->pages), GFP_KERNEL);
|
|
if (!umem->pages) {
|
|
err = -ENOMEM;
|
|
goto out_account;
|
|
}
|
|
|
|
for (i = 0; i < umem->npgs; i++)
|
|
umem->pages[i].addr = page_address(umem->pgs[i]);
|
|
|
|
return 0;
|
|
|
|
out_account:
|
|
xdp_umem_unaccount_pages(umem);
|
|
return err;
|
|
}
|
|
|
|
struct xdp_umem *xdp_umem_create(struct xdp_umem_reg *mr)
|
|
{
|
|
struct xdp_umem *umem;
|
|
int err;
|
|
|
|
umem = kzalloc(sizeof(*umem), GFP_KERNEL);
|
|
if (!umem)
|
|
return ERR_PTR(-ENOMEM);
|
|
|
|
err = ida_simple_get(&umem_ida, 0, 0, GFP_KERNEL);
|
|
if (err < 0) {
|
|
kfree(umem);
|
|
return ERR_PTR(err);
|
|
}
|
|
umem->id = err;
|
|
|
|
err = xdp_umem_reg(umem, mr);
|
|
if (err) {
|
|
ida_simple_remove(&umem_ida, umem->id);
|
|
kfree(umem);
|
|
return ERR_PTR(err);
|
|
}
|
|
|
|
return umem;
|
|
}
|
|
|
|
bool xdp_umem_validate_queues(struct xdp_umem *umem)
|
|
{
|
|
return umem->fq && umem->cq;
|
|
}
|