linux

mirror of https://github.com/torvalds/linux.git synced 2024-12-02 17:11:33 +00:00

Author	SHA1	Message	Date
Maor Gottlieb	1cabe6b096	net/mlx5e: Create aRFS flow tables Create the following four flow tables for aRFS usage: 1. IPv4 TCP - filtering 4-tuple of IPv4 TCP packets. 2. IPv6 TCP - filtering 4-tuple of IPv6 TCP packets. 3. IPv4 UDP - filtering 4-tuple of IPv4 UDP packets. 4. IPv6 UDP - filtering 4-tuple of IPv6 UDP packets. Each flow table has two flow groups: one for the 4-tuple filtering (full match) and the other contains * rule for miss rule. Full match rule means a hit for aRFS and packet will be forwarded to the dedicated RQ/Core, miss rule packets will be forwarded to default RSS hashing. Signed-off-by: Maor Gottlieb <maorg@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-04-29 16:29:11 -04:00
Maor Gottlieb	5a7b27eb9c	net/mlx5: Initializing CPU reverse mapping Allocating CPU rmap and add entry for each IRQ. CPU rmap is used in aRFS to get the RX queue number of the RX completion interrupts. Signed-off-by: Maor Gottlieb <maorg@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-04-29 16:29:11 -04:00
Maor Gottlieb	33cfaaa8f3	net/mlx5e: Split the main flow steering table Currently, the main flow table is used for two purposes: One is to do mac filtering and the other is to classify the packet l3-l4 header in order to steer the packet to the right RSS TIR. This design is very complex, for each configured mac address we have to add eleven rules (rule for each traffic type), the same if the device is put to promiscuous/allmulti mode. This scheme isn't scalable for future features like aRFS. In order to simplify it, the main flow table is split to two flow tables: 1. l2 table - filter the packet dmac address, if there is a match we forward to the ttc flow table. 2. TTC (Traffic Type Classifier) table - classify the traffic type of the packet and steer the packet to the right TIR. In this new design, when new mac address is added, the driver adds only one flow rule instead of eleven. Signed-off-by: Maor Gottlieb <maorg@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-04-29 16:29:11 -04:00
Maor Gottlieb	acff797cd1	net/mlx5e: Refactor mlx5e flow steering structs Slightly refactor and re-order the flow steering structs, tables and data-bases for better self-containment and flexibility to add more future steering phases (tables/rules/data bases) e.g: aRFS. Changes: 1. Move the vlan DB and address DB into their table structs. 2. Rename steering table structs to unique format: mlx5e_*_table, e.g: mlx5e_vlan_table. Signed-off-by: Maor Gottlieb <maorg@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-04-29 16:29:10 -04:00
Maor Gottlieb	13de6c106c	net/mlx5: Support different attributes for priorities in namespace Currently, namespace could be initialized only with priorities with the same attributes. Add support to initialize namespace with priorities with different attributes(e.g. different number of levels). Signed-off-by: Maor Gottlieb <maorg@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-04-29 16:29:10 -04:00
Maor Gottlieb	d63cd28608	net/mlx5: Add user chosen levels when allocating flow tables Currently, consumers of the flow steering infrastructure can't choose their own flow table levels and are limited to one flow table per level. This just waste levels. Instead, we introduce here the possibility to use multiple flow tables in a level. The user is free to connect these flow tables, while following the rule (FTEs in FT of level x could only point to FTs of level y where y > x). In addition this patch switch the order of the create/destroy flow tables of the NIC(vlan and main). Signed-off-by: Maor Gottlieb <maorg@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-04-29 16:29:09 -04:00
Maor Gottlieb	a257b94a18	net/mlx5: Set number of allowed levels in priority Refactors the flow steering namespace creation, by changing the name num_fts to num_levels. When new flow table is created, the driver assign new level to this flow table therefore the meaning is equivalent. Since downstream patches will introduce the ability to create more than one flow table per level, the name num_fts is no longer accurate. Signed-off-by: Maor Gottlieb <maorg@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-04-29 16:29:09 -04:00
Maor Gottlieb	d745098ced	net/mlx5: Introduce modify flow rule destination This API is used for modifying the flow rule destination. This is needed for modifying the pointed flow table by the traffic type classifier rules to point on the aRFS tables. Signed-off-by: Maor Gottlieb <maorg@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-04-29 16:29:08 -04:00
Tariq Toukan	1da366964e	net/mlx5e: Direct TIR per RQ Introduce new TIRs for direct access per RQ. Now we have 2 available kinds of TIRs: - indirect TIR per traffic type, each points to one RQT (RSS RQT) same as before. - New direct TIR per RQ, each points to RQT with a size of one that forwards packets to that RQ only. Driver will open max channels (num cores) direct TIRs by default, they will be filled with the actual RQs once channels are allocated. Needed for downstream aRFS and ethtool direct steering functionalities. Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-04-29 16:29:08 -04:00
Matthew Finlay	01a14098d3	net/mlx5e: Call vxlan_get_rx_port() with rtnl lock Hold the rtnl lock when calling vxlan_get_rx_port(). Fixes: `b7aade1548` ("vxlan: break dependency with netdev drivers") Signed-off-by: Matthew Finlay <matt@mellanox.com> Reported-by: Alexander Duyck <alexander.duyck@gmail.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-04-29 16:29:08 -04:00
David S. Miller	4b2523c180	Merge branch 'enc28j60-small-improvements' Michael Heimpold says: ==================== net: ethernet: enc28j60: small improvements This series of two patches adds the following improvements to the driver: 1) Rework the central SPI read function so that it is compatible with SPI masters which only support half duplex transfers. 2) Add a device tree binding for the driver. Changelog: v3: * renamed and improved binding documentation as suggested by Rob Herring v2: * took care of Arnd Bergmann's review comments - allow to specify MAC address via DT - unconditionally define DT id table * increased the driver version minor number * driver author's email address bounces, removed from address list v1: * Initial submission ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2016-04-29 16:23:03 -04:00
Michael Heimpold	2dd355a007	net: ethernet: enc28j60: add device tree support The following patch adds the required match table for device tree support (and while at, fix the indent). It's also possible to specify the MAC address in the DT blob. Also add the corresponding binding documentation file. Signed-off-by: Michael Heimpold <mhei@heimpold.de> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-04-29 16:23:02 -04:00
Michael Heimpold	2957a28a0e	net: ethernet: enc28j60: support half-duplex SPI controllers The current spi_read_buf function fails on SPI host masters which are only half-duplex capable. Splitting the Tx and Rx part solves this issue. Tested on Raspberry Pi (full duplex) and I2SE Duckbill (half duplex). Signed-off-by: Michael Heimpold <mhei@heimpold.de> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-04-29 16:23:02 -04:00
Nikolay Aleksandrov	f4b05d27ec	net: constify is_skb_forwardable's arguments is_skb_forwardable is not supposed to change anything so constify its arguments Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-04-29 16:13:36 -04:00
David S. Miller	92aff96ac3	Merge branch 'ppp-rtnetlink' Guillaume Nault says: ==================== ppp: add rtnetlink support PPP devices lack the ability to be customised at creation time. In particular they can't be created in a given netns or with a particular name. Moving or renaming the device after creation is possible, but creates undesirable transient effects on servers where PPP devices are constantly created and removed, as users connect and disconnect. Implementing rtnetlink support solves this problem. The rtnetlink handlers implemented in this series are minimal, and can only replace the PPPIOCNEWUNIT ioctl. The rest of PPP ioctls remains necessary for any other operation on channels and units. It is perfectly possible to mix PPP devices created by rtnl and by ioctl(PPPIOCNEWUNIT). Devices will behave in the same way. mutex_trylock() is used to resolve the locking issue wrt. locking dependency between rtnl_lock() and ppp_mutex (see ppp_nl_newlink() in patch #2). A user visible difference brought by this series is that old PPP interfaces (those created with ioctl(PPPIOCNEWUNIT)), can now be removed by "ip link del", just like new rtnl based PPP devices. Changes since v3: - Rebase on net-next. - Not an RFC anymore. Changes since v2: - Define ->rtnl_link_ops for ioctl based PPP devices, so they can handle rtnl messages just like rtnl based ones (suggested by Stephen Hemminger). - Move back to original lock ordering between ppp_mutex and rtnl_lock to simplify patch series. Handle lock inversion issue using mutex_trylock() (suggested by Stephen Hemminger). - Do file descriptor lookup directly in ppp_nl_newlink(), to simplify ppp_dev_configure(). Changes since v1: - Rebase on net-next. - Invert locking order wrt. ppp_mutex and rtnl_lock and protect file->private_data with ppp_mutex. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2016-04-29 16:09:45 -04:00
Guillaume Nault	96d934c70d	ppp: add rtnetlink device creation support Define PPP device handler for use with rtnetlink. The only PPP specific attribute is IFLA_PPP_DEV_FD. It is mandatory and contains the file descriptor of the associated /dev/ppp instance (the file descriptor which would have been used for ioctl(PPPIOCNEWUNIT) in the ioctl-based API). The PPP device is removed when this file descriptor is released (same behaviour as with ioctl based PPP devices). PPP devices created with the rtnetlink API behave like the ones created with ioctl(PPPIOCNEWUNIT). In particular existing ioctls work the same way, no matter how the PPP device was created. The rtnl callbacks are also assigned to ioctl based PPP devices. This way, rtnl messages have the same effect on any PPP devices. The immediate effect is that all PPP devices, even ioctl-based ones, can now be removed with "ip link del". A minor difference still exists between ioctl and rtnl based PPP interfaces: in the device name, the number following the "ppp" prefix corresponds to the PPP unit number for ioctl based devices, while it is just an unrelated incrementing index for rtnl ones. Signed-off-by: Guillaume Nault <g.nault@alphalink.fr> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-04-29 16:09:44 -04:00
Guillaume Nault	7d9f0b4874	ppp: define reusable device creation functions Move PPP device initialisation and registration out of ppp_create_interface(). This prepares code for device registration with rtnetlink. While there, simplify the prototype of ppp_create_interface(): * Since ppp_dev_configure() takes care of setting file->private_data, there's no need to return a ppp structure to ppp_unattached_ioctl() anymore. * The unit parameter is made read/write so that ppp_create_interface() can tell which unit number has been assigned. Signed-off-by: Guillaume Nault <g.nault@alphalink.fr> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-04-29 16:09:44 -04:00
Alexandre TORGUE	ac1f74a7fc	net: ethernet: stmmac: update MDIO support for GMAC4 On new GMAC4 IP, MAC_MDIO_address register has been updated, and bitmaps changed. This patch takes into account those changes. Signed-off-by: Alexandre TORGUE <alexandre.torgue@st.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-04-29 15:14:24 -04:00
Jiri Benc	65226ef8ea	vxlan: fix initialization with custom link parameters Commit `0c867c9bf8` ("vxlan: move Ethernet initialization to a separate function") changed initialization order and as an unintended result, when the user specifies additional link parameters (such as IFLA_ADDRESS) while creating vxlan interface, those are overwritten by vxlan_ether_setup later. It's necessary to call ether_setup from withing the ->setup callback. That way, the correct parameters are set by rtnl_create_link later. This is done also for VXLAN-GPE, as we don't know the interface type yet at that point, and changed to the correct interface type later. Fixes: `0c867c9bf8` ("vxlan: move Ethernet initialization to a separate function") Reported-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: Jiri Benc <jbenc@redhat.com> Tested-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-04-29 15:08:56 -04:00
David S. Miller	638af17873	Merge branch 'samples-bpf-user-experience' Jesper Dangaard Brouer says: ==================== samples/bpf: Improve user experience It is a steep learning curve getting started with using the eBPF examples in samples/bpf/. There are several dependencies, and specific versions of these dependencies. Invoking make in the correct manor is also slightly obscure. This patchset cleanup, document and hopefully improves the first time user experience with the eBPF samples directory by auto-detecting certain scenarios. V4: - Address Naveen's nitpicks - Handle/fail if extra args are passed in LLC or CLANG (David Laight) V3: - Add Alexei's ACKs - Remove README paragraph about LLVM experimental BPF target as it only existed between LLVM version 3.6 to 3.7. V2: - Adjusted recommend minimum versions to 3.7.1 - Included clang build instructions - New patch adding CLANG variable and validation of command ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2016-04-29 14:26:32 -04:00
Jesper Dangaard Brouer	bdefbbf2ec	samples/bpf: like LLC also verify and allow redefining CLANG command Users are likely to manually compile both LLVM 'llc' and 'clang' tools. Thus, also allow redefining CLANG and verify command exist. Makefile implementation wise, the target that verify the command have been generalized. Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-04-29 14:26:08 -04:00
Jesper Dangaard Brouer	b62a796c10	samples/bpf: allow make to be run from samples/bpf/ directory It is not intuitive that 'make' must be run from the top level directory with argument "samples/bpf/" to compile these eBPF samples. Introduce a kbuild make file trick that allow make to be run from the "samples/bpf/" directory itself. It basically change to the top level directory and call "make samples/bpf/" with the "/" slash after the directory name. Also add a clean target that only cleans this directory, by taking advantage of the kbuild external module setting M=$PWD. Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-04-29 14:25:33 -04:00
Jesper Dangaard Brouer	1c97566d51	samples/bpf: add a README file to get users started Getting started with using examples in samples/bpf/ is not straightforward. There are several dependencies, and specific versions of these dependencies. Just compiling the example tool is also slightly obscure, e.g. one need to call make like: make samples/bpf/ Do notice the "/" slash after the directory name. Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Acked-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-04-29 14:25:32 -04:00
Jesper Dangaard Brouer	7b01dd5793	samples/bpf: Makefile verify LLVM compiler avail and bpf target is supported Make compiling samples/bpf more user friendly, by detecting if LLVM compiler tool 'llc' is available, and also detect if the 'bpf' target is available in this version of LLVM. Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-04-29 14:25:32 -04:00
Jesper Dangaard Brouer	6ccfba75d3	samples/bpf: add back functionality to redefine LLC command It is practical to be-able-to redefine the location of the LLVM command 'llc', because not all distros have a LLVM version with bpf target support. Thus, it is sometimes required to compile LLVM from source, and sometimes it is not desired to overwrite the distros default LLVM version. This feature was removed with `128d1514be` ("samples/bpf: Use llc in PATH, rather than a hardcoded value"). Add this features back. Note that it is possible to redefine the LLC on the make command like: make samples/bpf/ LLC=~/git/llvm/build/bin/llc Fixes: `128d1514be` ("samples/bpf: Use llc in PATH, rather than a hardcoded value") Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-04-29 14:25:32 -04:00
David S. Miller	c23846c143	Merge branch 'cxgb4-mbox-cmd-logging' Hariprasad Shenai says: ==================== cxgb4/cxgb4vf: add support for mbox cmd logging This patch series adds support for logging mailbox commands and replies for debugging purpose for both PF and VF driver. This patch series has been created against net-next tree and includes patches on cxgb4 and cxgb4vf driver. We have included all the maintainers of respective drivers. Kindly review the change and let us know in case of any review comments. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2016-04-29 13:41:47 -04:00
Hariprasad Shenai	ae7b757622	cxgb4vf: Add support to enable logging of firmware mailbox commands for VF Add new /sys/kernel/debug/ support to dump firmware mailbox commands and replies for debugging purpose. Based on original work by Casey Leedom <leedom@chelsio.com> Signed-off-by: Hariprasad Shenai <hariprasad@chelsio.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-04-29 13:41:46 -04:00
Hariprasad Shenai	7f080c3f2f	cxgb4: Add support to enable logging of firmware mailbox commands Add new /sys/kernel/debug/ support to dump a firmware mailbox command issued and replies for debugging purpose. Based on original work by Casey Leedom <leedom@chelsio.com> Signed-off-by: Hariprasad Shenai <hariprasad@chelsio.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-04-29 13:41:46 -04:00
David S. Miller	482f13aa5e	Merge branch 'hns-props' Yisen Zhuang says: ==================== net: hns: update DT properties according to Rob's comments There are some inappropriate properties definition in hns DT. We update the definition according to Rob's review comments and fix some typos in binding. For more details, please see individual patches. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2016-04-29 13:39:04 -04:00
$Yisen.Zhuang$Zhuangyuzeng$$ Yisen.Zhuang$Zhuangyuzeng$	ea991027ef	dts: hisi: update hns dst for changing property port-id to reg Indexes should generally be avoided. This patch changes property port-id to reg in dsaf port node. Signed-off-by: Yisen Zhuang <yisen.zhuang@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-04-29 13:39:04 -04:00
$Yisen.Zhuang$Zhuangyuzeng$$ Yisen.Zhuang$Zhuangyuzeng$	a1ecde2c6f	Documentation: Bindings: Update DT binding for hns dsaf node This patch changes property port-id to reg in dsaf port node, removes property cpld-ctrl-reg, and fixes some typos. Signed-off-by: Yisen Zhuang <yisen.zhuang@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-04-29 13:39:03 -04:00
$Yisen.Zhuang$Zhuangyuzeng$$ Yisen.Zhuang$Zhuangyuzeng$	0211b8fb5d	net: hns: change port-id property to reg property in dsaf port node Indexes should generally be avoided. So we use reg rather than port-id to index ports. Signed-off-by: Yisen Zhuang <yisen.zhuang@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-04-29 13:39:03 -04:00
$Yisen.Zhuang$Zhuangyuzeng$$ Yisen.Zhuang$Zhuangyuzeng$	1ffdfac99f	net: hns: remove cpld-ctrl-reg and add cell in the cpld-syscon property Because cpld-ctrl-reg property is offset base on cpld-syscon property, we make it as a cell in the cpld-syscon property. Signed-off-by: Yisen Zhuang <yisen.zhuang@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-04-29 13:39:02 -04:00
Mahesh Bandewar	494e8489db	ipvlan: Fix failure path in dev registration during link creation When newlink creation fails at device-registration, the port->count is decremented twice. Francesco Ruggeri (fruggeri@arista.com) found this issue in Macvlan and the same exists in IPvlan driver too. While fixing this issue I noticed another issue of missing unregister in case of failure, so adding it to the fix which is similar to the macvlan fix by Francesco in commit `3083796075` ("macvlan: fix failure during registration v3") Reported-by: Francesco Ruggeri <fruggeri@arista.com> Signed-off-by: Mahesh Bandewar <maheshb@google.com> CC: Eric Dumazet <edumazet@google.com> CC: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-04-28 17:23:08 -04:00
françois romieu	222e4d0b13	pch_gbe: replace private tx ring lock with common netif_tx_lock pch_gbe_tx_ring.tx_lock is only used in the hard_xmit handler and in the transmit completion reaper called from NAPI context. Compile-tested only. Potential victims Cced. Someone more knowledgeable may check if pch_gbe_tx_queue could have some use for a mmiowb. Signed-off-by: Francois Romieu <romieu@fr.zoreil.com> Cc: Darren Hart <dvhart@infradead.org> Cc: Andy Cress <andy.cress@us.kontron.com> Cc: bryan@fossetcon.org Signed-off-by: David S. Miller <davem@davemloft.net>	2016-04-28 17:19:58 -04:00
Florian Fainelli	badf3ada60	net: dsa: Provide CPU port statistics to master netdev This patch overloads the DSA master netdev, aka CPU Ethernet MAC to also include switch-side statistics, which is useful for debugging purposes, when the switch is not properly connected to the Ethernet MAC (duplex mismatch, (RG)MII electrical issues etc.). We accomplish this by retaining the original copy of the master netdev's ethtool_ops, and just overload the 3 operations we care about: get_sset_count, get_strings and get_ethtool_stats so as to intercept these calls and call into the original master_netdev ethtool_ops, plus our own. We take this approach as opposed to providing a set of DSA helper functions that would retrive the CPU port's statistics, because the entire purpose of DSA is to allow unmodified Ethernet MAC drivers to be used as CPU conduit interfaces, therefore, statistics overlay in such drivers would simply not scale. The new ethtool -S <iface> output would therefore look like this now: <iface> statistics p<2 digits cpu port number>_<switch MIB counter names> Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-04-28 17:16:17 -04:00
Eric Dumazet	0cef6a4c34	tcp: give prequeue mode some care TCP prequeue goal is to defer processing of incoming packets to user space thread currently blocked in a recvmsg() system call. Intent is to spend less time processing these packets on behalf of softirq handler, as softirq handler is unfair to normal process scheduler decisions, as it might interrupt threads that do not even use networking. Current prequeue implementation has following issues : 1) It only checks size of the prequeue against sk_rcvbuf It was fine 15 years ago when sk_rcvbuf was in the 64KB vicinity. But we now have ~8MB values to cope with modern networking needs. We have to add sk_rmem_alloc in the equation, since out of order packets can definitely use up to sk_rcvbuf memory themselves. 2) Even with a fixed memory truesize check, prequeue can be filled by thousands of packets. When prequeue needs to be flushed, either from sofirq context (in tcp_prequeue() or timer code), or process context (in tcp_prequeue_process()), this adds a latency spike which is often not desirable. I added a fixed limit of 32 packets, as this translated to a max flush time of 60 us on my test hosts. Also note that all packets in prequeue are not accounted for tcp_mem, since they are not charged against sk_forward_alloc at this point. This is probably not a big deal. Note that this might increase LINUX_MIB_TCPPREQUEUEDROPPED counts, which is misnamed, as packets are not dropped at all, but rather pushed to the stack (where they can be either consumed or dropped) Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-04-28 17:14:35 -04:00
Michal Kazior	b43e7199a9	fq: split out backlog update logic mac80211 (which will be the first user of the fq.h) recently started to support software A-MSDU aggregation. It glues skbuffs together into a single one so the backlog accounting needs to be more fine-grained. To avoid backlog sorting logic duplication split it up for re-use. Signed-off-by: Michal Kazior <michal.kazior@tieto.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-04-28 17:03:38 -04:00
Dan Carpenter	b43586576e	tipc: remove an unnecessary NULL check This is never called with a NULL "buf" and anyway, we dereference 's' on the lines before so it would Oops before we reach the check. Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Acked-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-04-28 16:54:12 -04:00
Arnd Bergmann	6b87663fbe	net/mlx5e: avoid stack overflow in mlx5e_open_channels struct mlx5e_channel_param is a large structure that is allocated on the stack of mlx5e_open_channels, and with a recent change it has grown beyond the warning size for the maximum stack that a single function should use: mellanox/mlx5/core/en_main.c: In function 'mlx5e_open_channels': mellanox/mlx5/core/en_main.c:1325:1: error: the frame size of 1072 bytes is larger than 1024 bytes [-Werror=frame-larger-than=] The function is already using dynamic allocation and is not in a fast path, so the easiest workaround is to use another kzalloc for allocating the channel parameters. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Fixes: `d3c9bc2743` ("net/mlx5e: Added ICO SQs") Acked-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-04-28 16:46:59 -04:00
Jason Wang	3df97ba830	tuntap: calculate rps hash only when needed There's no need to calculate rps hash if it was not enabled. So this patch export rps_needed and check it before trying to get rps hash. Tests (using pktgen to inject packets to guest) shows this can improve pps about 13% (when rps is disabled). Before: ~1150000 pps After: ~1300000 pps Cc: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Jason Wang <jasowang@redhat.com> ---- Changes from V1: - Fix build when CONFIG_RPS is not set Signed-off-by: David S. Miller <davem@davemloft.net>	2016-04-28 16:38:54 -04:00
David S. Miller	f345c9a572	Merge branch 'tcp-eor' Martin KaFai Lau says: ==================== tcp: Make use of MSG_EOR in tcp_sendmsg v4: ~ Do not set eor bit in do_tcp_sendpages() since there is no way to pass MSG_EOR from the userland now. ~ Avoid rmw by testing MSG_EOR first in tcp_sendmsg(). ~ Move TCP_SKB_CB(skb)->eor test to a new helper tcp_skb_can_collapse_to() (suggested by Soheil). ~ Add some packetdrill tests. v3: ~ Separate EOR marking from the SKBTX_ANY_TSTAMP logic. ~ Move the eor bit test back to the loop in tcp_sendmsg and tcp_sendpage because there could be >1 threads doing sendmsg. ~ Thanks to Eric Dumazet's suggestions on v2. ~ The TCP timestamp bug fixes are separated into other threads. v2: ~ Rework based on the recent work "add TX timestamping via cmsg" by Soheil Hassas Yeganeh <soheil.kdev@gmail.com> ~ This version takes the MSG_EOR bit as a signal of end-of-response-message and leave the selective timestamping job to the cmsg ~ Changes based on the v1 feedback (like avoid unlikely check in a loop and adding tcp_sendpage support) ~ The first 3 patches are bug fixes. The fixes in this series depend on the newly introduced txstamp_ack in net-next. I will make relevant patches against net after getting some feedback. ~ The test results are based on the recently posted net fix: "tcp: Fix SOF_TIMESTAMPING_TX_ACK when handling dup acks" One potential use case is to use MSG_EOR with SOF_TIMESTAMPING_TX_ACK to get a more accurate TCP ack timestamping on application protocol with multiple outgoing response messages (e.g. HTTP2). One of our use case is at the webserver. The webserver tracks the HTTP2 response latency by measuring when the webserver sends the first byte to the socket till the TCP ACK of the last byte is received. In the cases where we don't have client side measurement, measuring from the server side is the only option. In the cases we have the client side measurement, the server side data can also be used to justify/cross-check-with the client side data. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2016-04-28 16:14:20 -04:00
Martin KaFai Lau	a166140e81	tcp: Handle eor bit when fragmenting a skb When fragmenting a skb, the next_skb should carry the eor from prev_skb. The eor of prev_skb should also be reset. Packetdrill script for testing: ~~~~~~ +0 `sysctl -q -w net.ipv4.tcp_min_tso_segs=10` +0 `sysctl -q -w net.ipv4.tcp_no_metrics_save=1` +0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3 +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0 +0 bind(3, ..., ...) = 0 +0 listen(3, 1) = 0 0.100 < S 0:0(0) win 32792 <mss 1460,sackOK,nop,nop,nop,wscale 7> 0.100 > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 7> 0.200 < . 1:1(0) ack 1 win 257 0.200 accept(3, ..., ...) = 4 +0 setsockopt(4, SOL_TCP, TCP_NODELAY, [1], 4) = 0 0.200 sendto(4, ..., 15330, MSG_EOR, ..., ...) = 15330 0.200 sendto(4, ..., 730, 0, ..., ...) = 730 0.200 > . 1:7301(7300) ack 1 0.200 > . 7301:14601(7300) ack 1 0.300 < . 1:1(0) ack 14601 win 257 0.300 > P. 14601:15331(730) ack 1 0.300 > P. 15331:16061(730) ack 1 0.400 < . 1:1(0) ack 16061 win 257 0.400 close(4) = 0 0.400 > F. 16061:16061(0) ack 1 0.400 < F. 1:1(0) ack 16062 win 257 0.400 > . 16062:16062(0) ack 2 Signed-off-by: Martin KaFai Lau <kafai@fb.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Neal Cardwell <ncardwell@google.com> Cc: Soheil Hassas Yeganeh <soheil@google.com> Cc: Willem de Bruijn <willemb@google.com> Cc: Yuchung Cheng <ycheng@google.com> Acked-by: Eric Dumazet <edumazet@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-04-28 16:14:19 -04:00
Martin KaFai Lau	a643b5d41c	tcp: Handle eor bit when coalescing skb This patch: 1. Prevent next_skb from coalescing to the prev_skb if TCP_SKB_CB(prev_skb)->eor is set 2. Update the TCP_SKB_CB(prev_skb)->eor if coalescing is allowed Packetdrill script for testing: ~~~~~~ +0 `sysctl -q -w net.ipv4.tcp_min_tso_segs=10` +0 `sysctl -q -w net.ipv4.tcp_no_metrics_save=1` +0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3 +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0 +0 bind(3, ..., ...) = 0 +0 listen(3, 1) = 0 0.100 < S 0:0(0) win 32792 <mss 1460,sackOK,nop,nop,nop,wscale 7> 0.100 > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 7> 0.200 < . 1:1(0) ack 1 win 257 0.200 accept(3, ..., ...) = 4 +0 setsockopt(4, SOL_TCP, TCP_NODELAY, [1], 4) = 0 0.200 sendto(4, ..., 730, MSG_EOR, ..., ...) = 730 0.200 sendto(4, ..., 730, MSG_EOR, ..., ...) = 730 0.200 write(4, ..., 11680) = 11680 0.200 > P. 1:731(730) ack 1 0.200 > P. 731:1461(730) ack 1 0.200 > . 1461:8761(7300) ack 1 0.200 > P. 8761:13141(4380) ack 1 0.300 < . 1:1(0) ack 1 win 257 <sack 1461:13141,nop,nop> 0.300 > P. 1:731(730) ack 1 0.300 > P. 731:1461(730) ack 1 0.400 < . 1:1(0) ack 13141 win 257 0.400 close(4) = 0 0.400 > F. 13141:13141(0) ack 1 0.500 < F. 1:1(0) ack 13142 win 257 0.500 > . 13142:13142(0) ack 2 Signed-off-by: Martin KaFai Lau <kafai@fb.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Neal Cardwell <ncardwell@google.com> Cc: Soheil Hassas Yeganeh <soheil@google.com> Cc: Willem de Bruijn <willemb@google.com> Cc: Yuchung Cheng <ycheng@google.com> Acked-by: Eric Dumazet <edumazet@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-04-28 16:14:19 -04:00
Martin KaFai Lau	c134ecb878	tcp: Make use of MSG_EOR in tcp_sendmsg This patch adds an eor bit to the TCP_SKB_CB. When MSG_EOR is passed to tcp_sendmsg, the eor bit will be set at the skb containing the last byte of the userland's msg. The eor bit will prevent data from appending to that skb in the future. The change in do_tcp_sendpages is to honor the eor set during the previous tcp_sendmsg(MSG_EOR) call. This patch handles the tcp_sendmsg case. The followup patches will handle other skb coalescing and fragment cases. One potential use case is to use MSG_EOR with SOF_TIMESTAMPING_TX_ACK to get a more accurate TCP ack timestamping on application protocol with multiple outgoing response messages (e.g. HTTP2). Packetdrill script for testing: ~~~~~~ +0 `sysctl -q -w net.ipv4.tcp_min_tso_segs=10` +0 `sysctl -q -w net.ipv4.tcp_no_metrics_save=1` +0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3 +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0 +0 bind(3, ..., ...) = 0 +0 listen(3, 1) = 0 0.100 < S 0:0(0) win 32792 <mss 1460,sackOK,nop,nop,nop,wscale 7> 0.100 > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 7> 0.200 < . 1:1(0) ack 1 win 257 0.200 accept(3, ..., ...) = 4 +0 setsockopt(4, SOL_TCP, TCP_NODELAY, [1], 4) = 0 0.200 write(4, ..., 14600) = 14600 0.200 sendto(4, ..., 730, MSG_EOR, ..., ...) = 730 0.200 sendto(4, ..., 730, MSG_EOR, ..., ...) = 730 0.200 > . 1:7301(7300) ack 1 0.200 > P. 7301:14601(7300) ack 1 0.300 < . 1:1(0) ack 14601 win 257 0.300 > P. 14601:15331(730) ack 1 0.300 > P. 15331:16061(730) ack 1 0.400 < . 1:1(0) ack 16061 win 257 0.400 close(4) = 0 0.400 > F. 16061:16061(0) ack 1 0.400 < F. 1:1(0) ack 16062 win 257 0.400 > . 16062:16062(0) ack 2 Signed-off-by: Martin KaFai Lau <kafai@fb.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Neal Cardwell <ncardwell@google.com> Cc: Soheil Hassas Yeganeh <soheil@google.com> Cc: Willem de Bruijn <willemb@google.com> Cc: Yuchung Cheng <ycheng@google.com> Suggested-by: Eric Dumazet <edumazet@google.com> Acked-by: Eric Dumazet <edumazet@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-04-28 16:14:18 -04:00
David S. Miller	2a9e8438a2	Merge branch 'tcp-redundant-checks' Soheil Hassas Yeganeh says: ==================== tcp: simplify ack tx timestamps v2: - Fully remove SKBTX_ACK_TSTAMP, as suggested by Willem de Bruijn. This patch series aims at removing redundant checks and fields for ack timestamps for TCP. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2016-04-28 16:06:11 -04:00
Soheil Hassas Yeganeh	0a2cf20c3f	tcp: remove SKBTX_ACK_TSTAMP since it is redundant The SKBTX_ACK_TSTAMP flag is set in skb_shinfo->tx_flags when the timestamp of the TCP acknowledgement should be reported on error queue. Since accessing skb_shinfo is likely to incur a cache-line miss at the time of receiving the ack, the txstamp_ack bit was added in tcp_skb_cb, which is set iff the SKBTX_ACK_TSTAMP flag is set for an skb. This makes SKBTX_ACK_TSTAMP flag redundant. Remove the SKBTX_ACK_TSTAMP and instead use the txstamp_ack bit everywhere. Note that this frees one bit in shinfo->tx_flags. Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Suggested-by: Willem de Bruijn <willemb@google.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-04-28 16:06:10 -04:00
Soheil Hassas Yeganeh	863c1fd981	tcp: remove an unnecessary check in tcp_tx_timestamp Remove the redundant check for sk->sk_tsflags in tcp_tx_timestamp. tcp_tx_timestamp() receives the tsflags as a parameter. As a result the "sk->sk_tsflags \|\| tsflags" is redundant, since tsflags already includes sk->sk_tsflags plus overrides from control messages. Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-04-28 16:06:10 -04:00
Eric Dumazet	ba7863f4d3	net: snmp: fix 64bit stats on 32bit arches I accidentally replaced BH disabling by preemption disabling in SNMP_ADD_STATS64() and SNMP_UPD_PO_STATS64() on 32bit builds. For 64bit stats on 32bit arch, we really need to disable BH, since the "struct u64_stats_sync syncp" might be manipulated both from process and BH contexts. Fixes: `6aef70a851` ("net: snmp: kill various STATS_USER() helpers") Reported-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Tested-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-04-28 11:49:45 -04:00
David S. Miller	8be2748a40	Merge branch 'socket-space-optimizations' Eric Dumazet says: ==================== net: avoid some atomic ops when FASYNC is not used We can avoid some atomic operations on sockets not using FASYNC ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2016-04-27 23:08:41 -04:00

1 2 3 4 5 ...

590634 Commits