Merge branch 'netlink-mmap'

Patrick McHardy says:

====================
The following patches contain an implementation of memory mapped I/O for
netlink. The implementation is modelled after AF_PACKET memory mapped I/O
with a few differences:

- In order to perform memory mapped I/O to userspace, the kernel allocates
  skbs with the data area pointing to the data area of the mapped frames.
  All netlink subsystems assume a linear data area, so for the sake of
  simplicity, the mapped data area is not attached to the paged area but
  to skb->data. This requires introduction of a special skb alloction
  function that just allocates an skb head without the data area. Since this
  is a quite rare use case, I introduced a new function based on __alloc_skb
  instead of splitting it up into head and data alloction. The alternative
  would be to   introduce an __alloc_skb_head and __alloc_skb_data function,
  which would actually be useful for a specific error case in memory mapped
  netlink, but would require a couple of extra instructions for the common
  skb allocation case, so it doesn't really seem worth it.

  In order to get the destination memory area for skb->data before message
  construction, memory mapped netlink I/O needs to look up the destination
  socket during allocation instead of during transmission because the
  ring is owned by the receiveing socket/process. A special skb allocation
  function (netlink_alloc_skb) taking the destination pid as an argument is
  used for this, all subsystems that want to support memory mapped I/O need
  to use this function, automatic fallback to the receive queue happens
  for unconverted subsystems. Dumps automatically use memory mapped I/O if
  the receiving socket has enabled it.

  The visible effect of looking up the destination socket during allocation
  instead of transmission is that message ordering in userspace might
  change in case allocation and transmission aren't performed atomically.
  This usually doesn't matter since most subsystems have a BKL-like lock
  like the rtnl mutex, to my knowledge the currently only existing case
  where it might matter is nfnetlink_queue combined with the recently
  introduced batched verdicts, but a) that subsystem already includes
  sequence numbers which allow userspace to reorder messages in case it
  cares to, also the reodering window is quite small and b) with memory
  mapped transmission batching can be performed in a subsystem indepandant
  manner.

- AF_NETLINK contains flow control for database dumps, with regular I/O
  dump continuation are triggered based on the sockets receive queue space
  and by recvmsg() calls. Since with memory mapped I/O there are no
  recvmsg() calls under normal operation, this is done in netlink_poll(),
  under the assumption that userspace has processed all pending frames
  before invoking poll(), thus the ring is expected to have room for new
  messages. Dumps currently don't benefit as much as they could from
  memory mapped I/O because each single continuation requires a poll()
  call. A more agressive approach seems like a good idea to me, especially
  in case the socket is not subscribed to any multicast groups (IOW only
  receiving explicitly requested data).

Besides that, the memory mapped netlink implementation extends the states
defined by AF_PACKET between userspace and the kernel by a SKIP status, this
is intended for the case that userspace wants to queue frames (specifically
when using nfnetlink_queue, an IDS and stream reassembly, requested by
Eric Leblond) for a longer period of time. The kernel skips over all frames
marked with SKIP when looking or unused frames and only fails when not finding
a free frame or when having skipped the entire ring.

Also noteworthy is memory mapped sendmsg: the kernel performs validation
of messages before accepting and processing them, in order to prevent
userspace from changing the messages contents after validation, the
kernel checks that the ring is only mapped once and the file descriptor
is not shared (in order to avoid having userspace set up another mapping
after the first mentioned check). If either of both is not true, the
message copied to an allocated skb and processed as with regular I/O.
I'd especially appreciate review of this part since I'm not really versed
in memory, file and process management,

The remaining interesting details are included in the changelogs of the
individual patches and the documentation, so I won't repeat them here.

As an example, nfnetlink_queue is convererted to support memory mapped
I/O. Other subsystems that would probably benefit are nfnetlink_log,
audit and maybe ISCSI, not sure.

Following are some numbers collected by Florian Westphal based on a
slightly older version, which included an experimental patch for the
nfnetlink_queue ordering issue.

===

Test hardware is a 12-core machine
Intel(R) Xeon(R) CPU E5-2620 0 @ 2.00GHz
ixgbe interfaces are used (i.e., multiqueue nics).
irqs are distributed across the cpus.

I've made several tests.

The simple one consists of 3GBit UDP traffic, packets are 1500 bytes
in size (i.e., no fragmentation), with a single nfqueue
and the test client programs in libmnl examples directory.
Packets are sent from one /24 net to another /24 net, i.e.
there are a few hundred flows active at any given time.

I've also tested with snort, but I disabled all rules.
6Gbit UDP traffic is generated in the snort case, and
6 nfqueues are used (i.e., 6 snorts run in parallel).

I've tested with 3 different kernels, all based on 3.7.1.
- 3.7.1, without the mmap patches
- 3.7.1, with Patricks mmap patches
- 3.7.1, with mmap patches and extended spinlock to ensure packet ids are
  monotonically increasing and cannot be re-ordered.  This is what we
  currently ship in our product.

  [ the spinlock that is extended is the per nfqueue spinlock, it will
    be held from the time the netlink skb is allocated until the netlink
    skb is sent to userspace:

    http://1984.lsi.us.es/git/nf-next/commit/?h=mmap-netlink3&id=b8eb19c46650fef4e9e4fe53f367f99bbf72afc9
  ]

snort is normally used in "batch mode", i.e., after processing 25 packets
a single "batch verdict" is sent to accept the packets seen so far.
"mmap snort" means RX_RING + sendmsg(), i.e. TX_RING is not used at this
time (except where noted below).

One reason is that snort has a reload thread, so kernel needs to copy;
also in the snort case no payload rewrite takes place, so compared
to the rx path the tx path is cheap.

Results:

3.7.1, without mmap patches, i.e. recv()+sendmsg() for everyone
nfq-queue:           1.7 gbit out
snort-recv-batch-25  5.1 gbit out
snort-recv-no-batch  3.1 gbit out

3.7.1 + mmap + without extended spinlocked section
nfq-queue:           1.7 gbit out (recv/sendmsg)
nfq-queue-mmap:      2.4 gbit out
snort-mmap-batch-25	 5.6 gbit out  (warning: since ids can be
                                        re-ordered, this version is "broken").
snort-recv-batch-25	 5.1 gbit out
snort-mmap-no-batch	 4.6 gbit out (i.e., one verdict per packet)

Kernel 3.7.1 + mmap + extended spinlock section:
nfq-queue:	1.4 gbit out
nfq-queue-mmap: 2.3 gbit out
snort:          5.6 gbit out

Conclusions:
- The "extended spinlocked section" hurts performance in the
  single queue case; with 6 snorts there is no measureable slowdown.
- I tried to re-write the mmap-snort to work without batch verdicts, but
  results were not very encouraging:

kernel 3.7.1 + mmap (without extended spinlocked section):

snort-mmap-batch-25      5.6 gbit out (what we currenlty ship)
snort-recv-batch-25      5.1 gbit out (without using mmap)
snort-mmap-batch-1       4.6 gbit out (with mmap but without batch verdicts)
snort-mmap-txring-25     5.2 gbit out (with mmap but without batch verdicts)
snort-mmap-txring-1      4.6 gbit out (with mmap but without batch verdicts)

The difference between the last two is that in the txring-25 case, we
put a verdict into the tx ring after every packet, but will only
invoke sendmsg(, NULL, 0) after processing 25 packets.  So the only
difference is the number of sendmsg calls/context switches.

So, i.o.w, kernel 3.7.1 + mmap + the extra locking crap is faster
than 3.7.1 + mmap-without-extra-locking and single-verdict-per packet.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
This commit is contained in:
David S. Miller 2013-04-19 15:37:09 -04:00
commit 42bbcb7803
21 changed files with 1331 additions and 74 deletions

View File

@ -0,0 +1,339 @@
This file documents how to use memory mapped I/O with netlink.
Author: Patrick McHardy <kaber@trash.net>
Overview
--------
Memory mapped netlink I/O can be used to increase throughput and decrease
overhead of unicast receive and transmit operations. Some netlink subsystems
require high throughput, these are mainly the netfilter subsystems
nfnetlink_queue and nfnetlink_log, but it can also help speed up large
dump operations of f.i. the routing database.
Memory mapped netlink I/O used two circular ring buffers for RX and TX which
are mapped into the processes address space.
The RX ring is used by the kernel to directly construct netlink messages into
user-space memory without copying them as done with regular socket I/O,
additionally as long as the ring contains messages no recvmsg() or poll()
syscalls have to be issued by user-space to get more message.
The TX ring is used to process messages directly from user-space memory, the
kernel processes all messages contained in the ring using a single sendmsg()
call.
Usage overview
--------------
In order to use memory mapped netlink I/O, user-space needs three main changes:
- ring setup
- conversion of the RX path to get messages from the ring instead of recvmsg()
- conversion of the TX path to construct messages into the ring
Ring setup is done using setsockopt() to provide the ring parameters to the
kernel, then a call to mmap() to map the ring into the processes address space:
- setsockopt(fd, SOL_NETLINK, NETLINK_RX_RING, &params, sizeof(params));
- setsockopt(fd, SOL_NETLINK, NETLINK_TX_RING, &params, sizeof(params));
- ring = mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0)
Usage of either ring is optional, but even if only the RX ring is used the
mapping still needs to be writable in order to update the frame status after
processing.
Conversion of the reception path involves calling poll() on the file
descriptor, once the socket is readable the frames from the ring are
processsed in order until no more messages are available, as indicated by
a status word in the frame header.
On kernel side, in order to make use of memory mapped I/O on receive, the
originating netlink subsystem needs to support memory mapped I/O, otherwise
it will use an allocated socket buffer as usual and the contents will be
copied to the ring on transmission, nullifying most of the performance gains.
Dumps of kernel databases automatically support memory mapped I/O.
Conversion of the transmit path involves changing message contruction to
use memory from the TX ring instead of (usually) a buffer declared on the
stack and setting up the frame header approriately. Optionally poll() can
be used to wait for free frames in the TX ring.
Structured and definitions for using memory mapped I/O are contained in
<linux/netlink.h>.
RX and TX rings
----------------
Each ring contains a number of continous memory blocks, containing frames of
fixed size dependant on the parameters used for ring setup.
Ring: [ block 0 ]
[ frame 0 ]
[ frame 1 ]
[ block 1 ]
[ frame 2 ]
[ frame 3 ]
...
[ block n ]
[ frame 2 * n ]
[ frame 2 * n + 1 ]
The blocks are only visible to the kernel, from the point of view of user-space
the ring just contains the frames in a continous memory zone.
The ring parameters used for setting up the ring are defined as follows:
struct nl_mmap_req {
unsigned int nm_block_size;
unsigned int nm_block_nr;
unsigned int nm_frame_size;
unsigned int nm_frame_nr;
};
Frames are grouped into blocks, where each block is a continous region of memory
and holds nm_block_size / nm_frame_size frames. The total number of frames in
the ring is nm_frame_nr. The following invariants hold:
- frames_per_block = nm_block_size / nm_frame_size
- nm_frame_nr = frames_per_block * nm_block_nr
Some parameters are constrained, specifically:
- nm_block_size must be a multiple of the architectures memory page size.
The getpagesize() function can be used to get the page size.
- nm_frame_size must be equal or larger to NL_MMAP_HDRLEN, IOW a frame must be
able to hold at least the frame header
- nm_frame_size must be smaller or equal to nm_block_size
- nm_frame_size must be a multiple of NL_MMAP_MSG_ALIGNMENT
- nm_frame_nr must equal the actual number of frames as specified above.
When the kernel can't allocate phsyically continous memory for a ring block,
it will fall back to use physically discontinous memory. This might affect
performance negatively, in order to avoid this the nm_frame_size parameter
should be chosen to be as small as possible for the required frame size and
the number of blocks should be increased instead.
Ring frames
------------
Each frames contain a frame header, consisting of a synchronization word and some
meta-data, and the message itself.
Frame: [ header message ]
The frame header is defined as follows:
struct nl_mmap_hdr {
unsigned int nm_status;
unsigned int nm_len;
__u32 nm_group;
/* credentials */
__u32 nm_pid;
__u32 nm_uid;
__u32 nm_gid;
};
- nm_status is used for synchronizing processing between the kernel and user-
space and specifies ownership of the frame as well as the operation to perform
- nm_len contains the length of the message contained in the data area
- nm_group specified the destination multicast group of message
- nm_pid, nm_uid and nm_gid contain the netlink pid, UID and GID of the sending
process. These values correspond to the data available using SOCK_PASSCRED in
the SCM_CREDENTIALS cmsg.
The possible values in the status word are:
- NL_MMAP_STATUS_UNUSED:
RX ring: frame belongs to the kernel and contains no message
for user-space. Approriate action is to invoke poll()
to wait for new messages.
TX ring: frame belongs to user-space and can be used for
message construction.
- NL_MMAP_STATUS_RESERVED:
RX ring only: frame is currently used by the kernel for message
construction and contains no valid message yet.
Appropriate action is to invoke poll() to wait for
new messages.
- NL_MMAP_STATUS_VALID:
RX ring: frame contains a valid message. Approriate action is
to process the message and release the frame back to
the kernel by setting the status to
NL_MMAP_STATUS_UNUSED or queue the frame by setting the
status to NL_MMAP_STATUS_SKIP.
TX ring: the frame contains a valid message from user-space to
be processed by the kernel. After completing processing
the kernel will release the frame back to user-space by
setting the status to NL_MMAP_STATUS_UNUSED.
- NL_MMAP_STATUS_COPY:
RX ring only: a message is ready to be processed but could not be
stored in the ring, either because it exceeded the
frame size or because the originating subsystem does
not support memory mapped I/O. Appropriate action is
to invoke recvmsg() to receive the message and release
the frame back to the kernel by setting the status to
NL_MMAP_STATUS_UNUSED.
- NL_MMAP_STATUS_SKIP:
RX ring only: user-space queued the message for later processing, but
processed some messages following it in the ring. The
kernel should skip this frame when looking for unused
frames.
The data area of a frame begins at a offset of NL_MMAP_HDRLEN relative to the
frame header.
TX limitations
--------------
Kernel processing usually involves validation of the message received by
user-space, then processing its contents. The kernel must assure that
userspace is not able to modify the message contents after they have been
validated. In order to do so, the message is copied from the ring frame
to an allocated buffer if either of these conditions is false:
- only a single mapping of the ring exists
- the file descriptor is not shared between processes
This means that for threaded programs, the kernel will fall back to copying.
Example
-------
Ring setup:
unsigned int block_size = 16 * getpagesize();
struct nl_mmap_req req = {
.nm_block_size = block_size,
.nm_block_nr = 64,
.nm_frame_size = 16384,
.nm_frame_nr = 64 * block_size / 16384,
};
unsigned int ring_size;
void *rx_ring, *tx_ring;
/* Configure ring parameters */
if (setsockopt(fd, NETLINK_RX_RING, &req, sizeof(req)) < 0)
exit(1);
if (setsockopt(fd, NETLINK_TX_RING, &req, sizeof(req)) < 0)
exit(1)
/* Calculate size of each invididual ring */
ring_size = req.nm_block_nr * req.nm_block_size;
/* Map RX/TX rings. The TX ring is located after the RX ring */
rx_ring = mmap(NULL, 2 * ring_size, PROT_READ | PROT_WRITE,
MAP_SHARED, fd, 0);
if ((long)rx_ring == -1L)
exit(1);
tx_ring = rx_ring + ring_size:
Message reception:
This example assumes some ring parameters of the ring setup are available.
unsigned int frame_offset = 0;
struct nl_mmap_hdr *hdr;
struct nlmsghdr *nlh;
unsigned char buf[16384];
ssize_t len;
while (1) {
struct pollfd pfds[1];
pfds[0].fd = fd;
pfds[0].events = POLLIN | POLLERR;
pfds[0].revents = 0;
if (poll(pfds, 1, -1) < 0 && errno != -EINTR)
exit(1);
/* Check for errors. Error handling omitted */
if (pfds[0].revents & POLLERR)
<handle error>
/* If no new messages, poll again */
if (!(pfds[0].revents & POLLIN))
continue;
/* Process all frames */
while (1) {
/* Get next frame header */
hdr = rx_ring + frame_offset;
if (hdr->nm_status == NL_MMAP_STATUS_VALID)
/* Regular memory mapped frame */
nlh = (void *hdr) + NL_MMAP_HDRLEN;
len = hdr->nm_len;
/* Release empty message immediately. May happen
* on error during message construction.
*/
if (len == 0)
goto release;
} else if (hdr->nm_status == NL_MMAP_STATUS_COPY) {
/* Frame queued to socket receive queue */
len = recv(fd, buf, sizeof(buf), MSG_DONTWAIT);
if (len <= 0)
break;
nlh = buf;
} else
/* No more messages to process, continue polling */
break;
process_msg(nlh);
release:
/* Release frame back to the kernel */
hdr->nm_status = NL_MMAP_STATUS_UNUSED;
/* Advance frame offset to next frame */
frame_offset = (frame_offset + frame_size) % ring_size;
}
}
Message transmission:
This example assumes some ring parameters of the ring setup are available.
A single message is constructed and transmitted, to send multiple messages
at once they would be constructed in consecutive frames before a final call
to sendto().
unsigned int frame_offset = 0;
struct nl_mmap_hdr *hdr;
struct nlmsghdr *nlh;
struct sockaddr_nl addr = {
.nl_family = AF_NETLINK,
};
hdr = tx_ring + frame_offset;
if (hdr->nm_status != NL_MMAP_STATUS_UNUSED)
/* No frame available. Use poll() to avoid. */
exit(1);
nlh = (void *)hdr + NL_MMAP_HDRLEN;
/* Build message */
build_message(nlh);
/* Fill frame header: length and status need to be set */
hdr->nm_len = nlh->nlmsg_len;
hdr->nm_status = NL_MMAP_STATUS_VALID;
if (sendto(fd, NULL, 0, 0, &addr, sizeof(addr)) < 0)
exit(1);
/* Advance frame offset to next frame */
frame_offset = (frame_offset + frame_size) % ring_size;

View File

@ -29,10 +29,13 @@ extern int nfnetlink_subsys_register(const struct nfnetlink_subsystem *n);
extern int nfnetlink_subsys_unregister(const struct nfnetlink_subsystem *n); extern int nfnetlink_subsys_unregister(const struct nfnetlink_subsystem *n);
extern int nfnetlink_has_listeners(struct net *net, unsigned int group); extern int nfnetlink_has_listeners(struct net *net, unsigned int group);
extern int nfnetlink_send(struct sk_buff *skb, struct net *net, u32 pid, unsigned int group, extern struct sk_buff *nfnetlink_alloc_skb(struct net *net, unsigned int size,
int echo, gfp_t flags); u32 dst_portid, gfp_t gfp_mask);
extern int nfnetlink_set_err(struct net *net, u32 pid, u32 group, int error); extern int nfnetlink_send(struct sk_buff *skb, struct net *net, u32 portid,
extern int nfnetlink_unicast(struct sk_buff *skb, struct net *net, u_int32_t pid, int flags); unsigned int group, int echo, gfp_t flags);
extern int nfnetlink_set_err(struct net *net, u32 portid, u32 group, int error);
extern int nfnetlink_unicast(struct sk_buff *skb, struct net *net,
u32 portid, int flags);
extern void nfnl_lock(__u8 subsys_id); extern void nfnl_lock(__u8 subsys_id);
extern void nfnl_unlock(__u8 subsys_id); extern void nfnl_unlock(__u8 subsys_id);

View File

@ -15,11 +15,18 @@ static inline struct nlmsghdr *nlmsg_hdr(const struct sk_buff *skb)
return (struct nlmsghdr *)skb->data; return (struct nlmsghdr *)skb->data;
} }
enum netlink_skb_flags {
NETLINK_SKB_MMAPED = 0x1, /* Packet data is mmaped */
NETLINK_SKB_TX = 0x2, /* Packet was sent by userspace */
NETLINK_SKB_DELIVERED = 0x4, /* Packet was delivered */
};
struct netlink_skb_parms { struct netlink_skb_parms {
struct scm_creds creds; /* Skb credentials */ struct scm_creds creds; /* Skb credentials */
__u32 portid; __u32 portid;
__u32 dst_group; __u32 dst_group;
struct sock *ssk; __u32 flags;
struct sock *sk;
}; };
#define NETLINK_CB(skb) (*(struct netlink_skb_parms*)&((skb)->cb)) #define NETLINK_CB(skb) (*(struct netlink_skb_parms*)&((skb)->cb))
@ -57,6 +64,8 @@ extern void __netlink_clear_multicast_users(struct sock *sk, unsigned int group)
extern void netlink_clear_multicast_users(struct sock *sk, unsigned int group); extern void netlink_clear_multicast_users(struct sock *sk, unsigned int group);
extern void netlink_ack(struct sk_buff *in_skb, struct nlmsghdr *nlh, int err); extern void netlink_ack(struct sk_buff *in_skb, struct nlmsghdr *nlh, int err);
extern int netlink_has_listeners(struct sock *sk, unsigned int group); extern int netlink_has_listeners(struct sock *sk, unsigned int group);
extern struct sk_buff *netlink_alloc_skb(struct sock *ssk, unsigned int size,
u32 dst_portid, gfp_t gfp_mask);
extern int netlink_unicast(struct sock *ssk, struct sk_buff *skb, __u32 portid, int nonblock); extern int netlink_unicast(struct sock *ssk, struct sk_buff *skb, __u32 portid, int nonblock);
extern int netlink_broadcast(struct sock *ssk, struct sk_buff *skb, __u32 portid, extern int netlink_broadcast(struct sock *ssk, struct sk_buff *skb, __u32 portid,
__u32 group, gfp_t allocation); __u32 group, gfp_t allocation);

View File

@ -651,6 +651,12 @@ static inline struct sk_buff *alloc_skb_fclone(unsigned int size,
return __alloc_skb(size, priority, SKB_ALLOC_FCLONE, NUMA_NO_NODE); return __alloc_skb(size, priority, SKB_ALLOC_FCLONE, NUMA_NO_NODE);
} }
extern struct sk_buff *__alloc_skb_head(gfp_t priority, int node);
static inline struct sk_buff *alloc_skb_head(gfp_t priority)
{
return __alloc_skb_head(priority, -1);
}
extern struct sk_buff *skb_morph(struct sk_buff *dst, struct sk_buff *src); extern struct sk_buff *skb_morph(struct sk_buff *dst, struct sk_buff *src);
extern int skb_copy_ubufs(struct sk_buff *skb, gfp_t gfp_mask); extern int skb_copy_ubufs(struct sk_buff *skb, gfp_t gfp_mask);
extern struct sk_buff *skb_clone(struct sk_buff *skb, extern struct sk_buff *skb_clone(struct sk_buff *skb,

View File

@ -184,7 +184,7 @@ extern int nf_conntrack_hash_check_insert(struct nf_conn *ct);
extern void nf_ct_delete_from_lists(struct nf_conn *ct); extern void nf_ct_delete_from_lists(struct nf_conn *ct);
extern void nf_ct_dying_timeout(struct nf_conn *ct); extern void nf_ct_dying_timeout(struct nf_conn *ct);
extern void nf_conntrack_flush_report(struct net *net, u32 pid, int report); extern void nf_conntrack_flush_report(struct net *net, u32 portid, int report);
extern bool nf_ct_get_tuplepr(const struct sk_buff *skb, extern bool nf_ct_get_tuplepr(const struct sk_buff *skb,
unsigned int nhoff, u_int16_t l3num, unsigned int nhoff, u_int16_t l3num,

View File

@ -88,7 +88,7 @@ nf_ct_find_expectation(struct net *net, u16 zone,
const struct nf_conntrack_tuple *tuple); const struct nf_conntrack_tuple *tuple);
void nf_ct_unlink_expect_report(struct nf_conntrack_expect *exp, void nf_ct_unlink_expect_report(struct nf_conntrack_expect *exp,
u32 pid, int report); u32 portid, int report);
static inline void nf_ct_unlink_expect(struct nf_conntrack_expect *exp) static inline void nf_ct_unlink_expect(struct nf_conntrack_expect *exp)
{ {
nf_ct_unlink_expect_report(exp, 0, 0); nf_ct_unlink_expect_report(exp, 0, 0);
@ -106,7 +106,7 @@ void nf_ct_expect_init(struct nf_conntrack_expect *, unsigned int, u_int8_t,
u_int8_t, const __be16 *, const __be16 *); u_int8_t, const __be16 *, const __be16 *);
void nf_ct_expect_put(struct nf_conntrack_expect *exp); void nf_ct_expect_put(struct nf_conntrack_expect *exp);
int nf_ct_expect_related_report(struct nf_conntrack_expect *expect, int nf_ct_expect_related_report(struct nf_conntrack_expect *expect,
u32 pid, int report); u32 portid, int report);
static inline int nf_ct_expect_related(struct nf_conntrack_expect *expect) static inline int nf_ct_expect_related(struct nf_conntrack_expect *expect)
{ {
return nf_ct_expect_related_report(expect, 0, 0); return nf_ct_expect_related_report(expect, 0, 0);

View File

@ -1,6 +1,7 @@
#ifndef _UAPI__LINUX_NETLINK_H #ifndef _UAPI__LINUX_NETLINK_H
#define _UAPI__LINUX_NETLINK_H #define _UAPI__LINUX_NETLINK_H
#include <linux/kernel.h>
#include <linux/socket.h> /* for __kernel_sa_family_t */ #include <linux/socket.h> /* for __kernel_sa_family_t */
#include <linux/types.h> #include <linux/types.h>
@ -105,11 +106,42 @@ struct nlmsgerr {
#define NETLINK_PKTINFO 3 #define NETLINK_PKTINFO 3
#define NETLINK_BROADCAST_ERROR 4 #define NETLINK_BROADCAST_ERROR 4
#define NETLINK_NO_ENOBUFS 5 #define NETLINK_NO_ENOBUFS 5
#define NETLINK_RX_RING 6
#define NETLINK_TX_RING 7
struct nl_pktinfo { struct nl_pktinfo {
__u32 group; __u32 group;
}; };
struct nl_mmap_req {
unsigned int nm_block_size;
unsigned int nm_block_nr;
unsigned int nm_frame_size;
unsigned int nm_frame_nr;
};
struct nl_mmap_hdr {
unsigned int nm_status;
unsigned int nm_len;
__u32 nm_group;
/* credentials */
__u32 nm_pid;
__u32 nm_uid;
__u32 nm_gid;
};
enum nl_mmap_status {
NL_MMAP_STATUS_UNUSED,
NL_MMAP_STATUS_RESERVED,
NL_MMAP_STATUS_VALID,
NL_MMAP_STATUS_COPY,
NL_MMAP_STATUS_SKIP,
};
#define NL_MMAP_MSG_ALIGNMENT NLMSG_ALIGNTO
#define NL_MMAP_MSG_ALIGN(sz) __ALIGN_KERNEL(sz, NL_MMAP_MSG_ALIGNMENT)
#define NL_MMAP_HDRLEN NL_MMAP_MSG_ALIGN(sizeof(struct nl_mmap_hdr))
#define NET_MAJOR 36 /* Major 36 is reserved for networking */ #define NET_MAJOR 36 /* Major 36 is reserved for networking */
enum { enum {

View File

@ -25,9 +25,18 @@ struct netlink_diag_msg {
__u32 ndiag_cookie[2]; __u32 ndiag_cookie[2];
}; };
struct netlink_diag_ring {
__u32 ndr_block_size;
__u32 ndr_block_nr;
__u32 ndr_frame_size;
__u32 ndr_frame_nr;
};
enum { enum {
NETLINK_DIAG_MEMINFO, NETLINK_DIAG_MEMINFO,
NETLINK_DIAG_GROUPS, NETLINK_DIAG_GROUPS,
NETLINK_DIAG_RX_RING,
NETLINK_DIAG_TX_RING,
__NETLINK_DIAG_MAX, __NETLINK_DIAG_MAX,
}; };
@ -38,5 +47,6 @@ enum {
#define NDIAG_SHOW_MEMINFO 0x00000001 /* show memory info of a socket */ #define NDIAG_SHOW_MEMINFO 0x00000001 /* show memory info of a socket */
#define NDIAG_SHOW_GROUPS 0x00000002 /* show groups of a netlink socket */ #define NDIAG_SHOW_GROUPS 0x00000002 /* show groups of a netlink socket */
#define NDIAG_SHOW_RING_CFG 0x00000004 /* show ring configuration */
#endif #endif

View File

@ -23,6 +23,15 @@ menuconfig NET
if NET if NET
config NETLINK_MMAP
bool "Netlink: mmaped IO"
help
This option enables support for memory mapped netlink IO. This
reduces overhead by avoiding copying data between kernel- and
userspace.
If unsure, say N.
config WANT_COMPAT_NETLINK_MESSAGES config WANT_COMPAT_NETLINK_MESSAGES
bool bool
help help

View File

@ -179,6 +179,33 @@ out:
* *
*/ */
struct sk_buff *__alloc_skb_head(gfp_t gfp_mask, int node)
{
struct sk_buff *skb;
/* Get the HEAD */
skb = kmem_cache_alloc_node(skbuff_head_cache,
gfp_mask & ~__GFP_DMA, node);
if (!skb)
goto out;
/*
* Only clear those fields we need to clear, not those that we will
* actually initialise below. Hence, don't put any more fields after
* the tail pointer in struct sk_buff!
*/
memset(skb, 0, offsetof(struct sk_buff, tail));
skb->data = NULL;
skb->truesize = sizeof(struct sk_buff);
atomic_set(&skb->users, 1);
#ifdef NET_SKBUFF_DATA_USES_OFFSET
skb->mac_header = ~0U;
#endif
out:
return skb;
}
/** /**
* __alloc_skb - allocate a network buffer * __alloc_skb - allocate a network buffer
* @size: size to allocate * @size: size to allocate
@ -584,7 +611,8 @@ static void skb_release_head_state(struct sk_buff *skb)
static void skb_release_all(struct sk_buff *skb) static void skb_release_all(struct sk_buff *skb)
{ {
skb_release_head_state(skb); skb_release_head_state(skb);
skb_release_data(skb); if (likely(skb->data))
skb_release_data(skb);
} }
/** /**

View File

@ -324,7 +324,7 @@ int inet_diag_dump_one_icsk(struct inet_hashinfo *hashinfo, struct sk_buff *in_s
} }
err = sk_diag_fill(sk, rep, req, err = sk_diag_fill(sk, rep, req,
sk_user_ns(NETLINK_CB(in_skb).ssk), sk_user_ns(NETLINK_CB(in_skb).sk),
NETLINK_CB(in_skb).portid, NETLINK_CB(in_skb).portid,
nlh->nlmsg_seq, 0, nlh); nlh->nlmsg_seq, 0, nlh);
if (err < 0) { if (err < 0) {
@ -630,7 +630,7 @@ static int inet_csk_diag_dump(struct sock *sk,
return 0; return 0;
return inet_csk_diag_fill(sk, skb, r, return inet_csk_diag_fill(sk, skb, r,
sk_user_ns(NETLINK_CB(cb->skb).ssk), sk_user_ns(NETLINK_CB(cb->skb).sk),
NETLINK_CB(cb->skb).portid, NETLINK_CB(cb->skb).portid,
cb->nlh->nlmsg_seq, NLM_F_MULTI, cb->nlh); cb->nlh->nlmsg_seq, NLM_F_MULTI, cb->nlh);
} }
@ -805,7 +805,7 @@ static int inet_diag_dump_reqs(struct sk_buff *skb, struct sock *sk,
} }
err = inet_diag_fill_req(skb, sk, req, err = inet_diag_fill_req(skb, sk, req,
sk_user_ns(NETLINK_CB(cb->skb).ssk), sk_user_ns(NETLINK_CB(cb->skb).sk),
NETLINK_CB(cb->skb).portid, NETLINK_CB(cb->skb).portid,
cb->nlh->nlmsg_seq, cb->nlh); cb->nlh->nlmsg_seq, cb->nlh);
if (err < 0) { if (err < 0) {

View File

@ -25,7 +25,7 @@ static int sk_diag_dump(struct sock *sk, struct sk_buff *skb,
return 0; return 0;
return inet_sk_diag_fill(sk, NULL, skb, req, return inet_sk_diag_fill(sk, NULL, skb, req,
sk_user_ns(NETLINK_CB(cb->skb).ssk), sk_user_ns(NETLINK_CB(cb->skb).sk),
NETLINK_CB(cb->skb).portid, NETLINK_CB(cb->skb).portid,
cb->nlh->nlmsg_seq, NLM_F_MULTI, cb->nlh); cb->nlh->nlmsg_seq, NLM_F_MULTI, cb->nlh);
} }
@ -71,7 +71,7 @@ static int udp_dump_one(struct udp_table *tbl, struct sk_buff *in_skb,
goto out; goto out;
err = inet_sk_diag_fill(sk, NULL, rep, req, err = inet_sk_diag_fill(sk, NULL, rep, req,
sk_user_ns(NETLINK_CB(in_skb).ssk), sk_user_ns(NETLINK_CB(in_skb).sk),
NETLINK_CB(in_skb).portid, NETLINK_CB(in_skb).portid,
nlh->nlmsg_seq, 0, nlh); nlh->nlmsg_seq, 0, nlh);
if (err < 0) { if (err < 0) {

View File

@ -1260,7 +1260,7 @@ void nf_ct_iterate_cleanup(struct net *net,
EXPORT_SYMBOL_GPL(nf_ct_iterate_cleanup); EXPORT_SYMBOL_GPL(nf_ct_iterate_cleanup);
struct __nf_ct_flush_report { struct __nf_ct_flush_report {
u32 pid; u32 portid;
int report; int report;
}; };
@ -1275,7 +1275,7 @@ static int kill_report(struct nf_conn *i, void *data)
/* If we fail to deliver the event, death_by_timeout() will retry */ /* If we fail to deliver the event, death_by_timeout() will retry */
if (nf_conntrack_event_report(IPCT_DESTROY, i, if (nf_conntrack_event_report(IPCT_DESTROY, i,
fr->pid, fr->report) < 0) fr->portid, fr->report) < 0)
return 1; return 1;
/* Avoid the delivery of the destroy event in death_by_timeout(). */ /* Avoid the delivery of the destroy event in death_by_timeout(). */
@ -1298,10 +1298,10 @@ void nf_ct_free_hashtable(void *hash, unsigned int size)
} }
EXPORT_SYMBOL_GPL(nf_ct_free_hashtable); EXPORT_SYMBOL_GPL(nf_ct_free_hashtable);
void nf_conntrack_flush_report(struct net *net, u32 pid, int report) void nf_conntrack_flush_report(struct net *net, u32 portid, int report)
{ {
struct __nf_ct_flush_report fr = { struct __nf_ct_flush_report fr = {
.pid = pid, .portid = portid,
.report = report, .report = report,
}; };
nf_ct_iterate_cleanup(net, kill_report, &fr); nf_ct_iterate_cleanup(net, kill_report, &fr);

View File

@ -40,7 +40,7 @@ static struct kmem_cache *nf_ct_expect_cachep __read_mostly;
/* nf_conntrack_expect helper functions */ /* nf_conntrack_expect helper functions */
void nf_ct_unlink_expect_report(struct nf_conntrack_expect *exp, void nf_ct_unlink_expect_report(struct nf_conntrack_expect *exp,
u32 pid, int report) u32 portid, int report)
{ {
struct nf_conn_help *master_help = nfct_help(exp->master); struct nf_conn_help *master_help = nfct_help(exp->master);
struct net *net = nf_ct_exp_net(exp); struct net *net = nf_ct_exp_net(exp);
@ -54,7 +54,7 @@ void nf_ct_unlink_expect_report(struct nf_conntrack_expect *exp,
hlist_del(&exp->lnode); hlist_del(&exp->lnode);
master_help->expecting[exp->class]--; master_help->expecting[exp->class]--;
nf_ct_expect_event_report(IPEXP_DESTROY, exp, pid, report); nf_ct_expect_event_report(IPEXP_DESTROY, exp, portid, report);
nf_ct_expect_put(exp); nf_ct_expect_put(exp);
NF_CT_STAT_INC(net, expect_delete); NF_CT_STAT_INC(net, expect_delete);
@ -412,7 +412,7 @@ out:
} }
int nf_ct_expect_related_report(struct nf_conntrack_expect *expect, int nf_ct_expect_related_report(struct nf_conntrack_expect *expect,
u32 pid, int report) u32 portid, int report)
{ {
int ret; int ret;
@ -425,7 +425,7 @@ int nf_ct_expect_related_report(struct nf_conntrack_expect *expect,
if (ret < 0) if (ret < 0)
goto out; goto out;
spin_unlock_bh(&nf_conntrack_lock); spin_unlock_bh(&nf_conntrack_lock);
nf_ct_expect_event_report(IPEXP_NEW, expect, pid, report); nf_ct_expect_event_report(IPEXP_NEW, expect, portid, report);
return ret; return ret;
out: out:
spin_unlock_bh(&nf_conntrack_lock); spin_unlock_bh(&nf_conntrack_lock);

View File

@ -112,22 +112,30 @@ int nfnetlink_has_listeners(struct net *net, unsigned int group)
} }
EXPORT_SYMBOL_GPL(nfnetlink_has_listeners); EXPORT_SYMBOL_GPL(nfnetlink_has_listeners);
int nfnetlink_send(struct sk_buff *skb, struct net *net, u32 pid, struct sk_buff *nfnetlink_alloc_skb(struct net *net, unsigned int size,
u32 dst_portid, gfp_t gfp_mask)
{
return netlink_alloc_skb(net->nfnl, size, dst_portid, gfp_mask);
}
EXPORT_SYMBOL_GPL(nfnetlink_alloc_skb);
int nfnetlink_send(struct sk_buff *skb, struct net *net, u32 portid,
unsigned int group, int echo, gfp_t flags) unsigned int group, int echo, gfp_t flags)
{ {
return nlmsg_notify(net->nfnl, skb, pid, group, echo, flags); return nlmsg_notify(net->nfnl, skb, portid, group, echo, flags);
} }
EXPORT_SYMBOL_GPL(nfnetlink_send); EXPORT_SYMBOL_GPL(nfnetlink_send);
int nfnetlink_set_err(struct net *net, u32 pid, u32 group, int error) int nfnetlink_set_err(struct net *net, u32 portid, u32 group, int error)
{ {
return netlink_set_err(net->nfnl, pid, group, error); return netlink_set_err(net->nfnl, portid, group, error);
} }
EXPORT_SYMBOL_GPL(nfnetlink_set_err); EXPORT_SYMBOL_GPL(nfnetlink_set_err);
int nfnetlink_unicast(struct sk_buff *skb, struct net *net, u_int32_t pid, int flags) int nfnetlink_unicast(struct sk_buff *skb, struct net *net, u32 portid,
int flags)
{ {
return netlink_unicast(net->nfnl, skb, pid, flags); return netlink_unicast(net->nfnl, skb, portid, flags);
} }
EXPORT_SYMBOL_GPL(nfnetlink_unicast); EXPORT_SYMBOL_GPL(nfnetlink_unicast);

View File

@ -318,7 +318,7 @@ nfulnl_set_flags(struct nfulnl_instance *inst, u_int16_t flags)
} }
static struct sk_buff * static struct sk_buff *
nfulnl_alloc_skb(unsigned int inst_size, unsigned int pkt_size) nfulnl_alloc_skb(u32 peer_portid, unsigned int inst_size, unsigned int pkt_size)
{ {
struct sk_buff *skb; struct sk_buff *skb;
unsigned int n; unsigned int n;
@ -327,13 +327,14 @@ nfulnl_alloc_skb(unsigned int inst_size, unsigned int pkt_size)
* message. WARNING: has to be <= 128k due to slab restrictions */ * message. WARNING: has to be <= 128k due to slab restrictions */
n = max(inst_size, pkt_size); n = max(inst_size, pkt_size);
skb = alloc_skb(n, GFP_ATOMIC); skb = nfnetlink_alloc_skb(&init_net, n, peer_portid, GFP_ATOMIC);
if (!skb) { if (!skb) {
if (n > pkt_size) { if (n > pkt_size) {
/* try to allocate only as much as we need for current /* try to allocate only as much as we need for current
* packet */ * packet */
skb = alloc_skb(pkt_size, GFP_ATOMIC); skb = nfnetlink_alloc_skb(&init_net, pkt_size,
peer_portid, GFP_ATOMIC);
if (!skb) if (!skb)
pr_err("nfnetlink_log: can't even alloc %u bytes\n", pr_err("nfnetlink_log: can't even alloc %u bytes\n",
pkt_size); pkt_size);
@ -696,7 +697,8 @@ nfulnl_log_packet(u_int8_t pf,
} }
if (!inst->skb) { if (!inst->skb) {
inst->skb = nfulnl_alloc_skb(inst->nlbufsiz, size); inst->skb = nfulnl_alloc_skb(inst->peer_portid, inst->nlbufsiz,
size);
if (!inst->skb) if (!inst->skb)
goto alloc_failure; goto alloc_failure;
} }
@ -824,7 +826,7 @@ nfulnl_recv_config(struct sock *ctnl, struct sk_buff *skb,
inst = instance_create(net, group_num, inst = instance_create(net, group_num,
NETLINK_CB(skb).portid, NETLINK_CB(skb).portid,
sk_user_ns(NETLINK_CB(skb).ssk)); sk_user_ns(NETLINK_CB(skb).sk));
if (IS_ERR(inst)) { if (IS_ERR(inst)) {
ret = PTR_ERR(inst); ret = PTR_ERR(inst);
goto out; goto out;

View File

@ -339,7 +339,8 @@ nfqnl_build_packet_message(struct nfqnl_instance *queue,
if (queue->flags & NFQA_CFG_F_CONNTRACK) if (queue->flags & NFQA_CFG_F_CONNTRACK)
ct = nfqnl_ct_get(entskb, &size, &ctinfo); ct = nfqnl_ct_get(entskb, &size, &ctinfo);
skb = alloc_skb(size, GFP_ATOMIC); skb = nfnetlink_alloc_skb(&init_net, size, queue->peer_portid,
GFP_ATOMIC);
if (!skb) if (!skb)
return NULL; return NULL;

File diff suppressed because it is too large Load Diff

View File

@ -6,6 +6,20 @@
#define NLGRPSZ(x) (ALIGN(x, sizeof(unsigned long) * 8) / 8) #define NLGRPSZ(x) (ALIGN(x, sizeof(unsigned long) * 8) / 8)
#define NLGRPLONGS(x) (NLGRPSZ(x)/sizeof(unsigned long)) #define NLGRPLONGS(x) (NLGRPSZ(x)/sizeof(unsigned long))
struct netlink_ring {
void **pg_vec;
unsigned int head;
unsigned int frames_per_block;
unsigned int frame_size;
unsigned int frame_max;
unsigned int pg_vec_order;
unsigned int pg_vec_pages;
unsigned int pg_vec_len;
atomic_t pending;
};
struct netlink_sock { struct netlink_sock {
/* struct sock has to be the first member of netlink_sock */ /* struct sock has to be the first member of netlink_sock */
struct sock sk; struct sock sk;
@ -24,6 +38,12 @@ struct netlink_sock {
void (*netlink_rcv)(struct sk_buff *skb); void (*netlink_rcv)(struct sk_buff *skb);
void (*netlink_bind)(int group); void (*netlink_bind)(int group);
struct module *module; struct module *module;
#ifdef CONFIG_NETLINK_MMAP
struct mutex pg_vec_lock;
struct netlink_ring rx_ring;
struct netlink_ring tx_ring;
atomic_t mapped;
#endif /* CONFIG_NETLINK_MMAP */
}; };
static inline struct netlink_sock *nlk_sk(struct sock *sk) static inline struct netlink_sock *nlk_sk(struct sock *sk)

View File

@ -7,6 +7,34 @@
#include "af_netlink.h" #include "af_netlink.h"
static int sk_diag_put_ring(struct netlink_ring *ring, int nl_type,
struct sk_buff *nlskb)
{
struct netlink_diag_ring ndr;
ndr.ndr_block_size = ring->pg_vec_pages << PAGE_SHIFT;
ndr.ndr_block_nr = ring->pg_vec_len;
ndr.ndr_frame_size = ring->frame_size;
ndr.ndr_frame_nr = ring->frame_max + 1;
return nla_put(nlskb, nl_type, sizeof(ndr), &ndr);
}
static int sk_diag_put_rings_cfg(struct sock *sk, struct sk_buff *nlskb)
{
struct netlink_sock *nlk = nlk_sk(sk);
int ret;
mutex_lock(&nlk->pg_vec_lock);
ret = sk_diag_put_ring(&nlk->rx_ring, NETLINK_DIAG_RX_RING, nlskb);
if (!ret)
ret = sk_diag_put_ring(&nlk->tx_ring, NETLINK_DIAG_TX_RING,
nlskb);
mutex_unlock(&nlk->pg_vec_lock);
return ret;
}
static int sk_diag_dump_groups(struct sock *sk, struct sk_buff *nlskb) static int sk_diag_dump_groups(struct sock *sk, struct sk_buff *nlskb)
{ {
struct netlink_sock *nlk = nlk_sk(sk); struct netlink_sock *nlk = nlk_sk(sk);
@ -51,6 +79,10 @@ static int sk_diag_fill(struct sock *sk, struct sk_buff *skb,
sock_diag_put_meminfo(sk, skb, NETLINK_DIAG_MEMINFO)) sock_diag_put_meminfo(sk, skb, NETLINK_DIAG_MEMINFO))
goto out_nlmsg_trim; goto out_nlmsg_trim;
if ((req->ndiag_show & NDIAG_SHOW_RING_CFG) &&
sk_diag_put_rings_cfg(sk, skb))
goto out_nlmsg_trim;
return nlmsg_end(skb, nlh); return nlmsg_end(skb, nlh);
out_nlmsg_trim: out_nlmsg_trim:

View File

@ -393,7 +393,7 @@ static int flow_change(struct net *net, struct sk_buff *in_skb,
return -EOPNOTSUPP; return -EOPNOTSUPP;
if ((keymask & (FLOW_KEY_SKUID|FLOW_KEY_SKGID)) && if ((keymask & (FLOW_KEY_SKUID|FLOW_KEY_SKGID)) &&
sk_user_ns(NETLINK_CB(in_skb).ssk) != &init_user_ns) sk_user_ns(NETLINK_CB(in_skb).sk) != &init_user_ns)
return -EOPNOTSUPP; return -EOPNOTSUPP;
} }