linux

mirror of https://github.com/torvalds/linux.git synced 2024-11-22 20:22:09 +00:00

Author	SHA1	Message	Date
Alexander Aring	cc84b3c6b4	ipv6: export several functions This patch exports some neighbour discovery functions which can be used by 6lowpan neighbour discovery ops functionality then. Cc: David S. Miller <davem@davemloft.net> Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru> Cc: James Morris <jmorris@namei.org> Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org> Cc: Patrick McHardy <kaber@trash.net> Acked-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org> Reviewed-by: Stefan Schmidt <stefan@osg.samsung.com> Signed-off-by: Alexander Aring <aar@pengutronix.de> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-06-15 20:41:23 -07:00
Alexander Aring	f997c55c1d	ipv6: introduce neighbour discovery ops This patch introduces neighbour discovery ops callback structure. The idea is to separate the handling for 6LoWPAN into the 6lowpan module. These callback offers 6lowpan different handling, such as 802.15.4 short address handling or RFC6775 (Neighbor Discovery Optimization for IPv6 over 6LoWPANs). Cc: David S. Miller <davem@davemloft.net> Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru> Cc: James Morris <jmorris@namei.org> Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org> Cc: Patrick McHardy <kaber@trash.net> Acked-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org> Signed-off-by: Alexander Aring <aar@pengutronix.de> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-06-15 20:41:23 -07:00
Alexander Aring	4f672235cb	addrconf: put prefix address add in an own function This patch moves the functionality to add a RA PIO prefix generated address in an own function. This move prepares to add a hook for adding a second address for a second link-layer address. E.g. short address for 802.15.4 6LoWPAN. Cc: David S. Miller <davem@davemloft.net> Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru> Cc: James Morris <jmorris@namei.org> Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org> Cc: Patrick McHardy <kaber@trash.net> Reviewed-by: Stefan Schmidt <stefan@osg.samsung.com> Acked-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org> Signed-off-by: Alexander Aring <aar@pengutronix.de> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-06-15 20:41:23 -07:00
Alexander Aring	8ec5da4150	ndisc: add __ndisc_fill_addr_option function This patch adds __ndisc_fill_addr_option as low-level function for ndisc_fill_addr_option which doesn't depend on net_device parameter. Cc: David S. Miller <davem@davemloft.net> Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru> Cc: James Morris <jmorris@namei.org> Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org> Cc: Patrick McHardy <kaber@trash.net> Acked-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org> Reviewed-by: Stefan Schmidt <stefan@osg.samsung.com> Signed-off-by: Alexander Aring <aar@pengutronix.de> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-06-15 20:41:23 -07:00
Alexander Aring	4f36ce84c5	ndisc: add __ndisc_opt_addr_data function This patch adds __ndisc_opt_addr_data as low-level function for ndisc_opt_addr_data which doesn't depend on net_device parameter. Cc: David S. Miller <davem@davemloft.net> Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru> Cc: James Morris <jmorris@namei.org> Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org> Cc: Patrick McHardy <kaber@trash.net> Acked-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org> Reviewed-by: Stefan Schmidt <stefan@osg.samsung.com> Signed-off-by: Alexander Aring <aar@pengutronix.de> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-06-15 20:41:22 -07:00
Alexander Aring	1e82f961ac	ndisc: add __ndisc_opt_addr_space function This patch adds __ndisc_opt_addr_space as low-level function for ndisc_opt_addr_space which doesn't depend on net_device parameter. Cc: David S. Miller <davem@davemloft.net> Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru> Cc: James Morris <jmorris@namei.org> Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org> Cc: Patrick McHardy <kaber@trash.net> Acked-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org> Reviewed-by: Stefan Schmidt <stefan@osg.samsung.com> Signed-off-by: Alexander Aring <aar@pengutronix.de> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-06-15 20:41:22 -07:00
Alexander Aring	848484c931	6lowpan: remove ipv6 module request Since we use exported function from ipv6 kernel module we don't need to request the module anymore to have ipv6 functionality. Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Reviewed-by: Stefan Schmidt <stefan@osg.samsung.com> Signed-off-by: Alexander Aring <aar@pengutronix.de> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-06-15 20:41:22 -07:00
Alexander Aring	2ad3ed5919	6lowpan: add 802.15.4 short addr slaac This patch adds the autoconfiguration if a valid 802.15.4 short address is available for 802.15.4 6LoWPAN interfaces. Cc: David S. Miller <davem@davemloft.net> Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru> Cc: James Morris <jmorris@namei.org> Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org> Cc: Patrick McHardy <kaber@trash.net> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Reviewed-by: Stefan Schmidt <stefan@osg.samsung.com> Signed-off-by: Alexander Aring <aar@pengutronix.de> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-06-15 20:41:22 -07:00
Alexander Aring	8626a0c83b	6lowpan: add private neighbour data This patch will introduce a 6lowpan neighbour private data. Like the interface private data we handle private data for generic 6lowpan and for link-layer specific 6lowpan. The current first use case if to save the short address for a 802.15.4 6lowpan neighbour. Cc: David S. Miller <davem@davemloft.net> Reviewed-by: Stefan Schmidt <stefan@osg.samsung.com> Acked-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org> Signed-off-by: Alexander Aring <aar@pengutronix.de> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-06-15 20:41:22 -07:00
David S. Miller	6010097806	Merge branch 'cxgb4-sriov-sysfs' Hariprasad Shenai says: ==================== Add SRIOV configuration via sysfs and few fixes This series adds support to configure SR-IOV via PCI sysfs interface, reduces resource allocation in kdump kernel by disabling offload. Also synchronize unicast and multicast mac address, even in the interface is in Promiscuous mode. This patch series has been created against net-next tree and includes patches on cxgb4 and cxgb4vf driver. We have included all the maintainers of respective drivers. Kindly review the change and let us know in case of any review comments. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2016-06-15 14:46:05 -07:00
Hariprasad Shenai	d01f7abc91	cxgb4/cxgb4vf: Synchronize all MAC addresses Even if interface is in Promiscuous mode/Allmulti mode synchronize MAC addresses. Signed-off-by: Hariprasad Shenai <hariprasad@chelsio.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-06-15 14:46:05 -07:00
Hariprasad Shenai	b6244201f4	cxgb4: Enable SR-IOV configuration via PCI sysfs interface Implement callback in the driver for the new PCI bus driver interface that allows the user to enable/disable SR-IOV virtual functions in a device via the sysfs interface. Deprecate module parameter used to configure SRIOV Signed-off-by: Hariprasad Shenai <hariprasad@chelsio.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-06-15 14:46:04 -07:00
Hariprasad Shenai	c5a8c0f3aa	cxgb4: Force cxgb4 driver as MASTER in kdump kernel When is_kdump_kernel() is true, Forcing cxgb4 driver as Master so we can reinitialize the Firmware/Chip. Also reduce memory usage by disabling offload. Signed-off-by: Hariprasad Shenai <hariprasad@chelsio.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-06-15 14:46:04 -07:00
David S. Miller	88da48f497	Merge branch 'sched_skb_free_defer' Eric Dumazet says: ==================== net_sched: defer skb freeing while changing qdiscs qdiscs/classes are changed under RTNL protection and often while blocking BH and root qdisc spinlock. When lots of skbs need to be dropped, we free them under these locks causing TX/RX freezes, and more generally latency spikes. I saw spikes of 50+ ms on quite fast hardware... This patch series adds a simple queue protected by RTNL where skbs can be placed until RTNL is released. Note that this might also serve in the future for optional reinjection of packets when a qdisc is replaced. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2016-06-15 14:08:36 -07:00
Eric Dumazet	fea024784f	net_sched: sch_fq: defer skb freeing sfq_reset() can use rtnl_kfree_skbs() instead of kfree_skb() Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-06-15 14:08:36 -07:00
Eric Dumazet	db4879d93c	net_sched: sch_pie: defer skb freeing pie_change() can use rtnl_qdisc_drop() to benefit from deferred freeing. pie_reset() is already using qdisc_reset_queue() Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-06-15 14:08:36 -07:00
Eric Dumazet	2f08a9a162	net_sched: sch_netem: defer skb freeing rtnl_kfree_skbs() can be used in tfifo_reset() It would be nice if we could iterate through rb tree instead of removing one skb at a time, and build a single skb chain. But this is left for a future patch. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-06-15 14:08:35 -07:00
Eric Dumazet	a5a9f5346f	net_sched: sch_htb: defer skb freeing Both htb_reset() and htb_destroy() can use __qdisc_reset_queue() instead of __skb_queue_purge() to defer skb freeing of internal queues. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-06-15 14:08:35 -07:00
Eric Dumazet	e7e424cdc4	net_sched: sch_hhf: defer skb freeing Both hhf_reset() and hhf_change() can use rtnl_kfree_skbs() Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-06-15 14:08:35 -07:00
Eric Dumazet	ece5d4c723	net_sched: fq_codel: defer skb freeing Both fq_codel_change() and fq_codel_reset() can use rtnl_kfree_skbs() Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-06-15 14:08:35 -07:00
Eric Dumazet	e14ffdfdd6	net_sched: sch_fq: defer skb freeing Both fq_change() and fq_reset() can use rtnl_kfree_skbs() Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-06-15 14:08:35 -07:00
Eric Dumazet	b3d7e2b29b	net_sched: sch_codel: defer skb freeing in codel_change() codel_change() can use rtnl_qdisc_drop() to defer expensive skb freeing after locks are released. codel_reset() already has support for deferred skb freeing because it uses qdisc_reset_queue() Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-06-15 14:08:35 -07:00
Eric Dumazet	f9aed311b6	net_sched: sch_choke: defer skb freeing choke_reset() and choke_change() can use rtnl_qdisc_drop() to defer expensive skb freeing after locks are released. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-06-15 14:08:34 -07:00
Eric Dumazet	1b5c5493e3	net_sched: add the ability to defer skb freeing qdisc are changed under RTNL protection and often while blocking BH and root qdisc spinlock. When lots of skbs need to be dropped, we free them under these locks causing TX/RX freezes, and more generally latency spikes. This commit adds rtnl_kfree_skbs(), used to queue skbs for deferred freeing. Actual freeing happens right after RTNL is released, with appropriate scheduling points. rtnl_qdisc_drop() can also be used in place of disc_drop() when RTNL is held. qdisc_reset_queue() and __qdisc_reset_queue() get the new behavior, so standard qdiscs like pfifo, pfifo_fast... have their ->reset() method automatically handled. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-06-15 14:08:34 -07:00
Jon Paul Maloy	35c55c9877	tipc: add neighbor monitoring framework TIPC based clusters are by default set up with full-mesh link connectivity between all nodes. Those links are expected to provide a short failure detection time, by default set to 1500 ms. Because of this, the background load for neighbor monitoring in an N-node cluster increases with a factor N on each node, while the overall monitoring traffic through the network infrastructure increases at a ~(N * (N - 1)) rate. Experience has shown that such clusters don't scale well beyond ~100 nodes unless we significantly increase failure discovery tolerance. This commit introduces a framework and an algorithm that drastically reduces this background load, while basically maintaining the original failure detection times across the whole cluster. Using this algorithm, background load will now grow at a rate of ~(2 * sqrt(N)) per node, and at ~(2 * N * sqrt(N)) in traffic overhead. As an example, each node will now have to actively monitor 38 neighbors in a 400-node cluster, instead of as before 399. This "Overlapping Ring Supervision Algorithm" is completely distributed and employs no centralized or coordinated state. It goes as follows: - Each node makes up a linearly ascending, circular list of all its N known neighbors, based on their TIPC node identity. This algorithm must be the same on all nodes. - The node then selects the next M = sqrt(N) - 1 nodes downstream from itself in the list, and chooses to actively monitor those. This is called its "local monitoring domain". - It creates a domain record describing the monitoring domain, and piggy-backs this in the data area of all neighbor monitoring messages (LINK_PROTOCOL/STATE) leaving that node. This means that all nodes in the cluster eventually (default within 400 ms) will learn about its monitoring domain. - Whenever a node discovers a change in its local domain, e.g., a node has been added or has gone down, it creates and sends out a new version of its node record to inform all neighbors about the change. - A node receiving a domain record from anybody outside its local domain matches this against its own list (which may not look the same), and chooses to not actively monitor those members of the received domain record that are also present in its own list. Instead, it relies on indications from the direct monitoring nodes if an indirectly monitored node has gone up or down. If a node is indicated lost, the receiving node temporarily activates its own direct monitoring towards that node in order to confirm, or not, that it is actually gone. - Since each node is actively monitoring sqrt(N) downstream neighbors, each node is also actively monitored by the same number of upstream neighbors. This means that all non-direct monitoring nodes normally will receive sqrt(N) indications that a node is gone. - A major drawback with ring monitoring is how it handles failures that cause massive network partitionings. If both a lost node and all its direct monitoring neighbors are inside the lost partition, the nodes in the remaining partition will never receive indications about the loss. To overcome this, each node also chooses to actively monitor some nodes outside its local domain. Those nodes are called remote domain "heads", and are selected in such a way that no node in the cluster will be more than two direct monitoring hops away. Because of this, each node, apart from monitoring the member of its local domain, will also typically monitor sqrt(N) remote head nodes. - As an optimization, local list status, domain status and domain records are marked with a generation number. This saves senders from unnecessarily conveying unaltered domain records, and receivers from performing unneeded re-adaptations of their node monitoring list, such as re-assigning domain heads. - As a measure of caution we have added the possibility to disable the new algorithm through configuration. We do this by keeping a threshold value for the cluster size; a cluster that grows beyond this value will switch from full-mesh to ring monitoring, and vice versa when it shrinks below the value. This means that if the threshold is set to a value larger than any anticipated cluster size (default size is 32) the new algorithm is effectively disabled. A patch set for altering the threshold value and for listing the table contents will follow shortly. - This change is fully backwards compatible. Acked-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-06-15 14:06:28 -07:00
David Ahern	7889681f4a	net: vrf: Update flags and features settings 1. Default VRF devices to not having a qdisc (IFF_NO_QUEUE). Users can add one as desired. 2. Disable adding a VLAN to a VRF device. 3. Enable offloads and hardware features similar to other logical devices (e.g., dummy, veth) Change provides a significant boost in TCP stream Tx performance, from ~2,700 Mbps to ~18,100 Mbps and makes throughput close to the performance without a VRF (18,500 Mbps). netperf TCP_STREAM benchmark using qemu with virtio+vhost for the NICs Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-06-15 14:03:48 -07:00
Paolo Abeni	df10db98ab	tun: fix csum generation for tap devices The commit `3416609363` ("tuntap: use common code for virtio_net_hdr and skb GSO conversion") replaced the tun code for header manipulation with the generic helpers. While doing so, it implictly moved the skb_partial_csum_set() invocation after eth_type_trans(), which invalidate the current gso start/offset values. Fix it by moving the helper invocation before the mac pulling. Fixes: `3416609363` ("tuntap: use common code for virtio_net_hdr and skb GSO conversion") Signed-off-by: Paolo Abeni <pabeni@redhat.com> Acked-by: Mike Rapoport <rppt@linux.vnet.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-06-15 14:00:33 -07:00
David S. Miller	829e64d160	Merge branch 'skb_array' Michael S. Tsirkin says: ==================== skb_array: array based FIFO for skbs This is in response to the proposal by Jason to make tun rx packet queue lockless using a circular buffer. My testing seems to show that at least for the common usecase in networking, which isn't lockless, circular buffer with indices does not perform that well, because each index access causes a cache line to bounce between CPUs, and index access causes stalls due to the dependency. By comparison, an array of pointers where NULL means invalid and !NULL means valid, can be updated without messing up barriers at all and does not have this issue. On the flip side, cache pressure may be caused by using large queues. tun has a queue of 1000 entries by default and that's 8K. At this point I'm not sure this can be solved efficiently. The correct solution might be sizing the queues appropriately. Here's an implementation of this idea: it can be used more or less whenever sk_buff_head can be used, except you need to know the queue size in advance. As this might be useful outside of networking, I implemented a generic array of void pointers, with a type-safe wrapper for skbs. It remains to be seen whether resizing is required, in case it is I included patches implementing resizing by holding both the consumer and the producer locks. I think this code works fine without any extra memory barriers since we always read and write the same location, so the accesses can not be reordered. Multiple writes of the same value into memory would mess things up for us, I don't think compilers would do it though. But if people feel it's better to be safe wrt compiler optimizations, specifying queue as volatile would probably do it in a cleaner way than converting all accesses to READ_ONCE/WRITE_ONCE. Thoughts? The only issue is with calls within a loop using the __ptr_ring_XXX accessors - in theory compiler could hoist accesses out of the loop. Following volatile-considered-harmful.txt I merely documented that callers that busy-poll should invoke cpu_relax(). Most people will use the external skb_array_XXX APIs with a spinlock, so this should not be an issue for them. Eric Dumazet suggested adding an extra pointer to skb for when we have a single outstanding packet. I could not figure out a way to implement this without a shared consumer/producer lock though, which would cause cache line bounces by itself. Jesper, Jason, I know that both of you tested this, please post Tested-by tags for whatever was tested. changes since v7 fix typos noticed by Jesper Brouer changes since v6 resize implemented. peek/full calls are no longer lockless replaced _FIELD macros with _CALL which invoke a function on the pointer rather than just returning a value destroy now scans the array and frees all queued skbs changes since v5 implemented a generic ptr_ring api, and made skb_array a type-safe wrapper apis for taking the spinlock in different contexts following expected usecase in tun changes since v4 (v3 was never posted) documentation dropped SKB_ARRAY_MIN_SIZE heuristic unit test (in userspace, included as patch 2) changes since v2: fixed integer overflow pointed out by Eric. added some comments. changes since v1: fixed bug pointed out by Eric. ==================== Tested-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-06-15 13:58:34 -07:00
Michael S. Tsirkin	7d7072e3ba	skb_array: resize support Update skb_array after ptr_ring API changes. Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Acked-by: Jesper Dangaard Brouer <brouer@redhat.com> Tested-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-06-15 13:58:27 -07:00
Michael S. Tsirkin	5d49de5320	ptr_ring: resize support This adds ring resize support. Seems to be necessary as users such as tun allow userspace control over queue size. If resize is used, this costs us ability to peek at queue without consumer lock - should not be a big deal as peek and consumer are usually run on the same CPU. If ring is made bigger, ring contents is preserved. If ring is made smaller, extra pointers are passed to an optional destructor callback. Cleanup function also gains destructor callback such that all pointers in queue can be cleaned up. This changes some APIs but we don't have any users yet, so it won't break bisect. Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Acked-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-06-15 13:58:27 -07:00
Michael S. Tsirkin	ad69f35d1d	skb_array: array based FIFO for skbs A simple array based FIFO of pointers. Intended for net stack so uses skbs for type safety. Implemented as a set of wrappers around ptr_ring. Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Acked-by: Jesper Dangaard Brouer <brouer@redhat.com> Tested-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-06-15 13:58:27 -07:00
Michael S. Tsirkin	9fb6bc5b4a	ptr_ring: ring test Add ringtest based unit test for ptr ring. Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-06-15 13:58:27 -07:00
Michael S. Tsirkin	2e0ab8ca83	ptr_ring: array based FIFO for pointers A simple array based FIFO of pointers. Intended for net stack which commonly has a single consumer/producer. Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Acked-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-06-15 13:57:21 -07:00
WANG Cong	b2313077ed	net_sched: make tcf_hash_check() boolean Cc: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-06-15 12:43:35 -07:00
David S. Miller	a6e225cad3	Merge branch 'vrf-ipv6-mcast-link-local' David Ahern says: ==================== net: vrf: Handle ipv6 multicast and link-local addresses IPv6 multicast and link-local addresses require special handling by the VRF driver. Rather than using the VRF device index and full FIB lookups, packets to/from these addresses should use direct FIB lookups based on the VRF device table. Multicast routes do not make sense for the L3 master device directly. Accordingly, do not add mcast routes for the device, and the VRF driver should fail attempts to send packets to ipv6 mcast addresses on the device (e.g, ping6 ff02::1%<vrf> should fail) With this change connections into and out of a VRF enslaved device work for multicast and link-local addresses (icmp, tcp, and udp). e.g., 1. packets into VM with VRF config: ping6 -c3 fe80::e0:f9ff:fe1c:b974%br1 ping6 -c3 ff02::1%br1 ssh -6 fe80::e0:f9ff:fe1c:b974%br1 2. packets going out a VRF enslaved device: ping6 -c3 fe80::18f8:83ff:fe4b:7a2e%eth1 ping6 -c3 ff02::1%eth1 ssh -6 root@fe80::18f8:83ff:fe4b:7a2e%eth1 ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2016-06-15 12:34:34 -07:00
David Ahern	9ff7438460	net: vrf: Handle ipv6 multicast and link-local addresses IPv6 multicast and link-local addresses require special handling by the VRF driver: 1. Rather than using the VRF device index and full FIB lookups, packets to/from these addresses should use direct FIB lookups based on the VRF device table. 2. fail sends/receives on a VRF device to/from a multicast address (e.g, make ping6 ff02::1%<vrf> fail) 3. move the setting of the flow oif to the first dst lookup and revert the change in icmpv6_echo_reply made in `ca254490c8` ("net: Add VRF support to IPv6 stack"). Linklocal/mcast addresses require use of the skb->dev. With this change connections into and out of a VRF enslaved device work for multicast and link-local addresses work (icmp, tcp, and udp) e.g., 1. packets into VM with VRF config: ping6 -c3 fe80::e0:f9ff:fe1c:b974%br1 ping6 -c3 ff02::1%br1 ssh -6 fe80::e0:f9ff:fe1c:b974%br1 2. packets going out a VRF enslaved device: ping6 -c3 fe80::18f8:83ff:fe4b:7a2e%eth1 ping6 -c3 ff02::1%eth1 ssh -6 root@fe80::18f8:83ff:fe4b:7a2e%eth1 Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-06-15 12:34:34 -07:00
David Ahern	ba46ee4c0e	net: ipv6: Do not add multicast route for l3 master devices L3 master devices are virtual devices similar to the loopback device. Link local and multicast routes for these devices do not make sense. The ipv6 addrconf code already skips adding a linklocal address; do the same for the mcast route. Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-06-15 12:34:34 -07:00
David Ahern	cd2a9e62c8	net: l3mdev: Remove const from flowi6 arg to get_rt6_dst Allow drivers to pass flow arg to functions where the arg is not const and allow the driver to make updates as needed (eg., setting oif). Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-06-15 12:34:34 -07:00
David S. Miller	c9ad5a6568	Merge branch 'af_iucv-big-bufs' Ursula Braun says: ==================== s390: af_iucv patches here are improvements for af_iucv relaxing the pressure to allocate big contiguous kernel buffers. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2016-06-15 12:21:05 -07:00
Eugene Crosser	a006353a9a	af_iucv: use paged SKBs for big inbound messages When an inbound message is bigger than a page, allocate a paged SKB, and subsequently use IUCV receive primitive with IPBUFLST flag. This relaxes the pressure to allocate big contiguous kernel buffers. Signed-off-by: Eugene Crosser <Eugene.Crosser@ru.ibm.com> Signed-off-by: Ursula Braun <ubraun@linux.vnet.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-06-15 12:21:05 -07:00
Eugene Crosser	291759a575	af_iucv: remove fragment_skb() to use paged SKBs Before introducing paged skbs in the receive path, get rid of the function `iucv_fragment_skb()` that replaces one large linear skb with several smaller linear skbs. Signed-off-by: Eugene Crosser <Eugene.Crosser@ru.ibm.com> Signed-off-by: Ursula Braun <ubraun@linux.vnet.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-06-15 12:21:04 -07:00
Eugene Crosser	e53743994e	af_iucv: use paged SKBs for big outbound messages When an outbound message is bigger than a page, allocate and fill a paged SKB, and subsequently use IUCV send primitive with IPBUFLST flag. This relaxes the pressure to allocate big contiguous kernel buffers. Signed-off-by: Eugene Crosser <Eugene.Crosser@ru.ibm.com> Signed-off-by: Ursula Braun <ubraun@linux.vnet.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-06-15 12:21:04 -07:00
Alexander Shiyan	818d49ad16	dt: bindings: Add bindings for Cirrus Logic CS89x0 ethernet chip Add device tree binding documentation details for Cirrus Logic CS8900/CS8920 ethernet chip. Signed-off-by: Alexander Shiyan <shc_work@mail.ru> Acked-by: Rob Herring <robh@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-06-15 12:17:57 -07:00
Alexander Shiyan	d3cf8fd3fc	net: cx89x0: Add DT support Add DT support to the Cirrus Logic CS89x0 driver. Signed-off-by: Alexander Shiyan <shc_work@mail.ru> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-06-15 12:17:57 -07:00
WANG Cong	d9fa17ef9f	act_police: rename tcf_act_police_locate() to tcf_act_police_init() This function is just ->init(), rename it to make it obvious. Cc: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-06-15 00:05:57 -07:00
WANG Cong	95df1b1607	net_sched: remove internal use of TC_POLICE_* These should be gone when we removed CONFIG_NET_CLS_POLICE. We can not totally remove them since they are exposed to userspace. Cc: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-06-15 00:05:57 -07:00
David S. Miller	161cd45ff0	Merge branch 'rds-mprds-foundations' Sowmini Varadhan says: ==================== RDS: multiple connection paths for scaling Today RDS-over-TCP is implemented by demux-ing multiple PF_RDS sockets between any 2 endpoints (where endpoint == [IP address, port]) over a single TCP socket between the 2 IP addresses involved. This has the limitation that it ends up funneling multiple RDS flows over a single TCP flow, thus the rds/tcp connection is (a) upper-bounded to the single-flow bandwidth, (b) suffers from head-of-line blocking for the RDS sockets. Better throughput (for a fixed small packet size, MTU) can be achieved by having multiple TCP/IP flows per rds/tcp connection, i.e., multipathed RDS (mprds). Each such TCP/IP flow constitutes a path for the rds/tcp connection. RDS sockets will be attached to a path based on some hash (e.g., of local address and RDS port number) and packets for that RDS socket will be sent over the attached path using TCP to segment/reassemble RDS datagrams on that path. The table below, generated using a prototype that implements mprds, shows that this is significant for scaling to 40G. Packet sizes used were: 8K byte req, 256 byte resp. MTU: 1500. The parameters for RDS-concurrency used below are described in the rds-stress(1) man page- the number listed is proportional to the number of threads at which max throughput was attained. ------------------------------------------------------------------- RDS-concurrency Num of tx+rx K/s (iops) throughput (-t N -d N) TCP paths ------------------------------------------------------------------- 16 1 600K - 700K 4 Gbps 28 8 5000K - 6000K 32 Gbps ------------------------------------------------------------------- FAQ: what is the relation between mprds and mptcp? mprds is orthogonal to mptcp. Whereas mptcp creates sub-flows for a single TCP connection, mprds parallelizes tx/rx at the RDS layer. MPRDS with N paths will allow N datagrams to be sent in parallel; each path will continue to send one datagram at a time, with sender and receiver keeping track of the retransmit and dgram-assembly state based on the RDS header. If desired, mptcp can additionally be used to speed up each TCP path. That acceleration is orthogonal to the parallelization benefits of mprds. This patch series lays down the foundational data-structures to support mprds in the kernel. It implements the changes to split up the rds_connection structure into a common (to all paths) part, and a per-path rds_conn_path. All I/O workqs are driven from the rds_conn_path. Note that this patchset does not (yet) actually enable multipathing for any of the transports; all transports will continue to use a single path with the refactored data-structures. A subsequent patchset will add the changes to the rds-tcp module to actually use mprds in rds-tcp. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2016-06-14 23:50:44 -07:00
Sowmini Varadhan	3ecc5693c0	RDS: Update rds_conn_destroy to be MP capable Refactor rds_conn_destroy() so that the per-path dismantling is done in rds_conn_path_destroy, and then iterate as needed over rds_conn_path_destroy(). Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-06-14 23:50:44 -07:00
Sowmini Varadhan	d769ef81d5	RDS: Update rds_conn_shutdown to work with rds_conn_path This commit changes rds_conn_shutdown to take a rds_conn_path * argument, allowing it to shutdown paths other than c_path[0] for MP-capable transports. Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-06-14 23:50:44 -07:00
Sowmini Varadhan	1c5113cf79	RDS: Initialize all RDS_MPATH_WORKERS in __rds_conn_create Add a for() loop in __rds_conn_create to initialize all the conn_paths, in preparate for MP capable transports. Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-06-14 23:50:44 -07:00

1 2 3 4 5 ...

602496 Commits