linux/net
Eric Dumazet d3836f21b0 net: allow skb->head to be a page fragment
skb->head is currently allocated from kmalloc(). This is convenient but
has the drawback the data cannot be converted to a page fragment if
needed.

We have three spots were it hurts :

1) GRO aggregation

 When a linear skb must be appended to another skb, GRO uses the
frag_list fallback, very inefficient since we keep all struct sk_buff
around. So drivers enabling GRO but delivering linear skbs to network
stack aren't enabling full GRO power.

2) splice(socket -> pipe).

 We must copy the linear part to a page fragment.
 This kind of defeats splice() purpose (zero copy claim)

3) TCP coalescing.

 Recently introduced, this permits to group several contiguous segments
into a single skb. This shortens queue lengths and save kernel memory,
and greatly reduce probabilities of TCP collapses. This coalescing
doesnt work on linear skbs (or we would need to copy data, this would be
too slow)

Given all these issues, the following patch introduces the possibility
of having skb->head be a fragment in itself. We use a new skb flag,
skb->head_frag to carry this information.

build_skb() is changed to accept a frag_size argument. Drivers willing
to provide a page fragment instead of kmalloc() data will set a non zero
value, set to the fragment size.

Then, on situations we need to convert the skb head to a frag in itself,
we can check if skb->head_frag is set and avoid the copies or various
fallbacks we have.

This means drivers currently using frags could be updated to avoid the
current skb->head allocation and reduce their memory footprint (aka skb
truesize). (thats 512 or 1024 bytes saved per skb). This also makes
bpf/netfilter faster since the 'first frag' will be part of skb linear
part, no need to copy data.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Maciej Żenczykowski <maze@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Tom Herbert <therbert@google.com>
Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Cc: Ben Hutchings <bhutchings@solarflare.com>
Cc: Matt Carlson <mcarlson@broadcom.com>
Cc: Michael Chan <mchan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-04-30 21:35:11 -04:00
..
9p net: cleanup unsigned to unsigned int 2012-04-15 12:44:40 -04:00
802 net: Convert all sysctl registrations to register_net_sysctl 2012-04-20 21:22:30 -04:00
8021q vlan: Stop using NLA_PUT*(). 2012-04-02 04:33:44 -04:00
appletalk net: Convert all sysctl registrations to register_net_sysctl 2012-04-20 21:22:30 -04:00
atm net: cleanup unsigned to unsigned int 2012-04-15 12:44:40 -04:00
ax25 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2012-04-23 23:15:17 -04:00
batman-adv batman-adv: skip the window protection test when the originator has no neighbours 2012-04-18 09:54:02 +02:00
bluetooth Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth 2012-04-09 15:47:49 -04:00
bridge bridge: Fix fatal typo in setup of multicast_querier_expired 2012-04-30 13:30:56 -04:00
caif net: cleanup unsigned to unsigned int 2012-04-15 12:44:40 -04:00
can can: fix sparse warning for cgw_list 2012-04-16 21:08:18 +02:00
ceph crush: include header for global symbols 2012-04-27 00:03:34 -04:00
core net: allow skb->head to be a page fragment 2012-04-30 21:35:11 -04:00
dcb net: dcb: add CEE notify calls 2012-04-25 19:47:17 -04:00
dccp net: Convert all sysctl registrations to register_net_sysctl 2012-04-20 21:22:30 -04:00
decnet net decnet: Convert to use register_net_sysctl 2012-04-20 21:22:29 -04:00
dns_resolver net: cleanup unsigned to unsigned int 2012-04-15 12:44:40 -04:00
dsa
econet sock: Introduce named constants for sk_reuse 2012-04-21 15:52:25 -04:00
ethernet net: cleanup unsigned to unsigned int 2012-04-15 12:44:40 -04:00
ieee802154 6lowpan: duplicate definition of IEEE802154_ALEN 2012-04-26 06:01:09 -04:00
ipv4 ipv6: RTAX_FEATURE_ALLFRAG causes inefficient TCP segment sizing 2012-04-27 00:03:34 -04:00
ipv6 net/ipv6/udp: UDP encapsulation: introduce encap_rcv hook into IPv6 2012-04-28 22:21:51 -04:00
ipx net: Convert all sysctl registrations to register_net_sysctl 2012-04-20 21:22:30 -04:00
irda net: Convert all sysctl registrations to register_net_sysctl 2012-04-20 21:22:30 -04:00
iucv Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux 2012-03-22 18:15:32 -07:00
key net: cleanup unsigned to unsigned int 2012-04-15 12:44:40 -04:00
l2tp l2tp: Add missing net/net/ip6_checksum.h include. 2012-04-30 13:21:28 -04:00
lapb Remove all #inclusions of asm/system.h 2012-03-28 18:30:03 +01:00
llc net: add a limit parameter to sk_add_backlog() 2012-04-23 22:28:28 -04:00
mac80211 Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/linville/wireless-next into for-davem 2012-04-26 15:03:48 -04:00
netfilter sock: Introduce named constants for sk_reuse 2012-04-21 15:52:25 -04:00
netlabel netlabel: use GFP flags from caller instead of GFP_ATOMIC 2012-03-22 19:29:57 -04:00
netlink af_netlink: drop_monitor/dropwatch friendly 2012-04-24 00:35:14 -04:00
netrom net: Convert all sysctl registrations to register_net_sysctl 2012-04-20 21:22:30 -04:00
nfc Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/linville/wireless-next into for-davem 2012-04-18 14:27:48 -04:00
openvswitch net: cleanup unsigned to unsigned int 2012-04-15 12:44:40 -04:00
packet af_packet: packet_getsockopt() cleanup 2012-04-21 16:36:42 -04:00
phonet net: Convert all sysctl registrations to register_net_sysctl 2012-04-20 21:22:30 -04:00
rds sock: Introduce named constants for sk_reuse 2012-04-21 15:52:25 -04:00
rfkill device.h: cleanup users outside of linux/include (C files) 2012-03-11 14:27:37 -04:00
rose net: Convert all sysctl registrations to register_net_sysctl 2012-04-20 21:22:30 -04:00
rxrpc net: cleanup unsigned to unsigned int 2012-04-15 12:44:40 -04:00
sched Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2012-04-23 23:15:17 -04:00
sctp net: add a limit parameter to sk_add_backlog() 2012-04-23 22:28:28 -04:00
sunrpc sock: Introduce named constants for sk_reuse 2012-04-21 15:52:25 -04:00
tipc tipc: remove inline instances from C source files. 2012-04-24 00:41:03 -04:00
unix net: sock_diag_handler structs can be const 2012-04-25 20:46:59 -04:00
wanrouter
wimax net: cleanup unsigned to unsigned int 2012-04-15 12:44:40 -04:00
wireless Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/linville/wireless-next into for-davem 2012-04-26 15:03:48 -04:00
x25 net: add a limit parameter to sk_add_backlog() 2012-04-23 22:28:28 -04:00
xfrm net: Convert all sysctl registrations to register_net_sysctl 2012-04-20 21:22:30 -04:00
compat.c net: cleanup unsigned to unsigned int 2012-04-15 12:44:40 -04:00
Kconfig
Makefile
nonet.c
socket.c net: change big iov allocations 2012-04-21 16:24:20 -04:00
sysctl_net.c net: Remove register_net_sysctl_table 2012-04-20 21:22:30 -04:00