linux/net
Florian Westphal d1b4c689d4 netlink: remove mmapped netlink support
mmapped netlink has a number of unresolved issues:

- TX zerocopy support had to be disabled more than a year ago via
  commit 4682a03586 ("netlink: Always copy on mmap TX.")
  because the content of the mmapped area can change after netlink
  attribute validation but before message processing.

- RX support was implemented mainly to speed up nfqueue dumping packet
  payload to userspace.  However, since commit ae08ce0021
  ("netfilter: nfnetlink_queue: zero copy support") we avoid one copy
  with the socket-based interface too (via the skb_zerocopy helper).

The other problem is that skbs attached to mmaped netlink socket
behave different from normal skbs:

- they don't have a shinfo area, so all functions that use skb_shinfo()
(e.g. skb_clone) cannot be used.

- reserving headroom prevents userspace from seeing the content as
it expects message to start at skb->head.
See for instance
commit aa3a022094 ("netlink: not trim skb for mmaped socket when dump").

- skbs handed e.g. to netlink_ack must have non-NULL skb->sk, else we
crash because it needs the sk to check if a tx ring is attached.

Also not obvious, leads to non-intuitive bug fixes such as 7c7bdf359
("netfilter: nfnetlink: use original skbuff when acking batches").

mmaped netlink also didn't play nicely with the skb_zerocopy helper
used by nfqueue and openvswitch.  Daniel Borkmann fixed this via
commit 6bb0fef489 ("netlink, mmap: fix edge-case leakages in nf queue
zero-copy")' but at the cost of also needing to provide remaining
length to the allocation function.

nfqueue also has problems when used with mmaped rx netlink:
- mmaped netlink doesn't allow use of nfqueue batch verdict messages.
  Problem is that in the mmap case, the allocation time also determines
  the ordering in which the frame will be seen by userspace (A
  allocating before B means that A is located in earlier ring slot,
  but this also means that B might get a lower sequence number then A
  since seqno is decided later.  To fix this we would need to extend the
  spinlocked region to also cover the allocation and message setup which
  isn't desirable.
- nfqueue can now be configured to queue large (GSO) skbs to userspace.
  Queing GSO packets is faster than having to force a software segmentation
  in the kernel, so this is a desirable option.  However, with a mmap based
  ring one has to use 64kb per ring slot element, else mmap has to fall back
  to the socket path (NL_MMAP_STATUS_COPY) for all large packets.

To use the mmap interface, userspace not only has to probe for mmap netlink
support, it also has to implement a recv/socket receive path in order to
handle messages that exceed the size of an rx ring element.

Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Ken-ichirou MATSUZAWA <chamaken@gmail.com>
Cc: Pablo Neira Ayuso <pablo@netfilter.org>
Cc: Patrick McHardy <kaber@trash.net>
Cc: Thomas Graf <tgraf@suug.ch>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-18 11:42:18 -05:00
..
6lowpan 6lowpan: fix debugfs interface entry name 2015-12-20 08:21:00 +01:00
9p Rework and error handling fixes, primarily in the fscatch and fd transports. 2016-01-24 12:39:09 -08:00
802
8021q vlan: change return type of vlan_proc_rem_dev 2016-02-17 22:06:28 -05:00
appletalk
atm net: Generalise wq_has_sleeper helper 2015-11-30 14:47:33 -05:00
ax25 net: add validation for the socket syscall protocol argument 2015-12-14 16:09:30 -05:00
batman-adv batman-adv: Convert batadv_tt_common_entry to kref 2016-02-10 23:24:06 +08:00
bluetooth Bluetooth: Fix incorrect removing of IRKs 2016-01-29 11:47:24 +01:00
bridge bridge: switchdev: Offload VLAN flags to hardware bridge 2016-02-18 11:18:11 -05:00
caif net: rename SOCK_ASYNC_NOSPACE and SOCK_ASYNC_WAITDATA 2015-12-01 15:45:05 -05:00
can can: avoid using timeval for uapi 2015-10-13 17:42:34 +02:00
ceph libceph: remove outdated comment 2016-01-21 19:36:09 +01:00
core net-sysfs: remove unused fmt_long_hex 2016-02-18 10:01:15 -05:00
dcb net/dcb: make dcbnl.c explicitly non-modular 2015-10-09 07:52:27 -07:00
dccp inet: refactor inet[6]_lookup functions to take skb 2016-02-11 03:54:14 -05:00
decnet net: add validation for the socket syscall protocol argument 2015-12-14 16:09:30 -05:00
dns_resolver net: dns_resolver: convert time_t to time64_t 2015-11-18 16:27:46 -05:00
dsa dsa: Register netdev before phy 2016-01-07 14:31:26 -05:00
ethernet net: Add eth_platform_get_mac_address() helper. 2016-01-06 16:31:56 -05:00
hsr net/hsr: fix a warning message 2015-11-23 14:56:15 -05:00
ieee802154 sock: struct proto hash function may error 2016-02-11 03:54:14 -05:00
ipv4 ipv4: Remove inet_lro library 2016-02-17 16:15:46 -05:00
ipv6 ipv4: namespacify ip_early_demux sysctl knob 2016-02-16 20:42:54 -05:00
ipx
irda irda: fix a potential use-after-free in ircomm_param_request 2016-01-29 22:56:46 -08:00
iucv af_iucv: Validate socket address length in iucv_sock_bind() 2016-01-19 14:21:08 -05:00
key af_key: fix two typos 2015-10-23 03:05:19 -07:00
l2tp inet: create IPv6-equivalent inet_hash function 2016-02-11 03:54:14 -05:00
l3mdev net: Add netif_is_l3_slave 2015-10-07 04:27:43 -07:00
lapb
llc af_llc: fix types on llc_ui_wait_for_conn 2016-02-17 16:12:13 -05:00
mac80211 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2016-02-01 15:56:08 -08:00
mac802154 mac802154: constify ieee802154_llsec_ops structure 2016-01-04 20:40:41 +01:00
mpls Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2015-12-17 22:08:28 -05:00
netfilter net: ip_tunnel: remove 'csum_help' argument to iptunnel_handle_offloads 2016-02-12 05:52:16 -05:00
netlabel
netlink netlink: remove mmapped netlink support 2016-02-18 11:42:18 -05:00
netrom
nfc NFC 4.5 pull request 2016-01-04 21:48:15 -05:00
openvswitch net: add dst_cache to ovs vxlan lwtunnel 2016-02-16 20:21:48 -05:00
packet packet: tpacket_snd gso and checksum offload 2016-02-09 06:43:50 -05:00
phonet sock: struct proto hash function may error 2016-02-11 03:54:14 -05:00
rds rds: duplicate include net/tcp.h 2016-02-11 09:45:24 -05:00
rfkill rfkill: fix rfkill_fop_read wait_event usage 2016-01-26 11:32:05 +01:00
rose
rxrpc Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next 2016-01-12 18:57:02 -08:00
sched net_sched: Improve readability of filter processing 2016-02-18 11:19:12 -05:00
sctp sctp: remove the unused sctp_datamsg_free() 2016-02-17 15:41:54 -05:00
sunrpc Initial roundup of 4.5 merge window patches 2016-01-23 18:45:06 -08:00
switchdev switchdev: Require RTNL mutex to be held when sending FDB notifications 2016-01-28 16:21:31 -08:00
tipc tipc: refactor node xmit and fix memory leaks 2016-02-16 15:58:40 -05:00
unix net: drop write-only stack variable 2016-02-07 14:06:26 -05:00
vmw_vsock Revert "Merge branch 'vsock-virtio'" 2015-12-08 21:55:49 -05:00
wimax
wireless regulatory: fix world regulatory domain data 2016-01-14 11:10:13 +01:00
x25
xfrm net: preserve IP control block during GSO segmentation 2016-01-15 14:35:24 -05:00
compat.c
Kconfig net: add dst_cache support 2016-02-16 20:21:48 -05:00
Makefile net: Introduce L3 Master device abstraction 2015-09-29 20:40:32 -07:00
socket.c kmemcg: account certain kmem allocations to memcg 2016-01-14 16:00:49 -08:00
sysctl_net.c net: sysctl: fix a kmemleak warning 2015-10-23 06:22:08 -07:00