Pavel Emelyanov says:
====================
tun: Some bits required for tun's checkpoint-restore (v2)
After taking a closer look on tun checkpoint-restore I've found several
issues with the tun's API that make it impossible to dump and restore
the state of tun device and attached tun-files.
The proposed API changes are all about extending the existing ioctl-based
stuff. Patches fit today's net-next.
This v2 has David's comments about patch #1 fixed. All the rest is the same.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
The only thing we may have from tun device is the fprog, whic contains
the number of filter elements and a pointer to (user-space) memory
where the elements are. The program itself may not be available if the
device is persistent and detached.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There's a small problem with sk-filters on tun devices. Consider
an application doing this sequence of steps:
fd = open("/dev/net/tun");
ioctl(fd, TUNSETIFF, { .ifr_name = "tun0" });
ioctl(fd, TUNATTACHFILTER, &my_filter);
ioctl(fd, TUNSETPERSIST, 1);
close(fd);
At that point the tun0 will remain in the system and will keep in
mind that there should be a socket filter at address '&my_filter'.
If after that we do
fd = open("/dev/net/tun");
ioctl(fd, TUNSETIFF, { .ifr_name = "tun0" });
we most likely receive the -EFAULT error, since tun_attach() would
try to connect the filter back. But (!) if we provide a filter at
address &my_filter, then tun0 will be created and the "new" filter
would be attached, but application may not know about that.
This may create certain problems to anyone using tun-s, but it's
critical problem for c/r -- if we meet a persistent tun device
with a filter in mind, we will not be able to attach to it to dump
its state (flags, owner, address, vnethdr size, etc.).
The proposal is to allow to attach to tun device (with TUNSETIFF)
w/o attaching the filter to the tun-file's socket. After this
attach app may e.g clean the device by dropping the filter, it
doesn't want to have one, or (in case of c/r) get information
about the device with tun ioctls.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Multiqueue tun devices allow to attach and detach from its queues
while keeping the interface itself set on file.
Knowing this is critical for the checkpoint part of criu project.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Tun devices cannot be created with ifidex user wants, but it's
required by checkpoint-restore project.
Long time ago such ability was implemented for rtnl_ops-based
interface for creating links (9c7dafbf net: Allow to create links
with given ifindex), but the only API for creating and managing
tuntap devices is ioctl-based and is evolving with adding new ones
(cde8b15f tuntap: add ioctl to attach or detach a file form tuntap
device).
Following that trend, here's how a new ioctl that sets the ifindex
for device, that _will_ be created by TUNSETIFF ioctl looks like.
So those who want a tuntap device with the ifindex N, should open
the tun device, call ioctl(fd, TUNSETIFINDEX, &N), then call TUNSETIFF.
If the index N is busy, then the register_netdev will find this out
and the ioctl would be failed with -EBUSY.
If setifindex is not called, then it will be generated as before.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The workarounds that currently use EFX_WORKAROUND_ALWAYS are in
Falcon-specific or Falcon-arch-specific code, so get rid of the
conditions altogether. Add/move comments as appropriate.
Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
EF10 functions don't have a fixed BAR size, and the minimum is not
large enough for all the queues we might want to allocate. We have to
find out the BAR size at run-time, and therefore phys_addr_channels
and mem_map_size cannot be defined per-NIC-type.
Change efx_nic_type::mem_map_size to a function pointer which is
called to find the wanted memory map size (before probe).
Replace efx_nic_type::phys_addr_channels with efx_nic::max_channels,
to be initialised by the probe function.
Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
When we poll for MCDI request completion, we don't hold the interface
lock while setting the response fields in struct efx_mcdi_iface.
Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
MCDI v2 adds a second header dword with wider command and length
fields. It also defines extra error codes.
Change the fallback error number for unknown MCDI error codes from EIO
to EPROTO. EIO is treated as indicating the MCDI transport has failed
and we need to reset the function, which is rather drastic.
v2 error codes and lengths don't fit into completion events, so for a
v2-capable transport, always read the response header rather then
using the event fields.
Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
EF10 controllers do not have shared memory for communication with the
MC; instead it reads requests and writes responses in host memory,
which allows for longer messages. It is also responsible for all
datapath control operations and hardware resource allocation, which
requires a large number of new commands and adds more possible error
cases. MCDI v2 extends the message header to support this.
Update the MCDI protocol definition header to include v2 lengths,
errors and messages, and a few definitions specific to the
SFC9100 family (codenames Farmingdale and Huntington) which is
the first generation of EF10.
Some messages have been extended, so adjust the code accordingly:
- The request for MC_CMD_DRV_ATTACH now includes a datapath firmware
ID. This is ignored by Siena but we should fill it in anyway,
initially always specifying low-latency datapath.
- The response for MC_CMD_GET_LOOPBACK_MODES now includes a 40G
field. Accept shorter responses that don't include it.
Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
Currently we only translate error codes in efx_mcdi_poll(), but we
also need to do so in efx_mcdi_ev_cpl().
The reason we didn't notice before is that the MC firmware error codes
are mostly taken from Unix/Linux and no translation is necessary on
most architectures. Make sure we notice any future failure by
changing the sign of resprc (matching the kernel convention) and BUG
if it's ever positive at command completion.
Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
Result of skb_frag_dma_map() and dma_map_single() wasn't checked.
Added a check and proper handling in case of failure.
Moved the mapping to the beginning of mlx4_en_xmit(), before updating
the ring data structure to make error handling easier.
Signed-off-by: Amir Vadai <amirv@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When hardware gets into error state, must notify user about it.
When QP in error state no traffic will be tx'ed from the attached
tx_ring.
Driver should know how to recover from this unexpected state. I will send later
on the recovery flow, but having the print shouldn't be delayed.
Signed-off-by: Amir Vadai <amirv@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fix a bug when FC and PFC are enabled/disabled at the same time.
According to ConnectX-3 Programmer Manual these two features are mutial
exclusive. So make sure when enabling PFC to turn off global FC and
vise versa. Otherwise it hurts the performance.
Signed-off-by: Eugenia Emantayev <eugenia@mellanox.com>
Signed-off-by: Amir Vadai <amirv@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add efx_nic_type operations for the many efx_nic functions that need
to be implemented different on EF10. For now, change most of the
existing efx_nic_*() functions into inline wrappers. As a later step,
we may be able to improve branch prediction for operations used on the
fast path by copying the pointers into each queue/channel structure.
Move the Falcon/Siena implementations to new file farch.c and rename
the functions and static data to use a prefix of 'efx_farch_'.
Move efx_may_push_tx_desc() to nic.h, as the EF10 TX code will also
use it.
Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
Each function driver instance uses the MAC address of the
lowest function belonging to that physical port as a unique
port identifier. This port identifier is read and cached in
driver during probe and provided to user space through
ndo_get_phys_port_id()
Signed-off-by: Shahed Shaikh <shahed.shaikh@qlogic.com>
Signed-off-by: Himanshu Madhani <himanshu.madhani@qlogic.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
o Enable diagnostic test via ethtool and QConvergeConsole
application when Multiple Tx queues are enabled on 82xx
series adapters.
Signed-off-by: Himanshu Madhani <himanshu.madhani@qlogic.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
o using ethtool {set|get}_channel option, user can change number
of Tx queues for 82xx Series adapter.
o updated ethtool -S <ethX> option to display stats from each Tx queue.
Signed-off-by: Himanshu Madhani <himanshu.madhani@qlogic.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
o 82xx firmware allows support for multiple Tx queues. This
patch will enable multi Tx queue support for 82xx series
adapter. Max number of Tx queues supported will be 8.
Signed-off-by: Himanshu Madhani <himanshu.madhani@qlogic.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently efx_stop_datapath() will try to flush our DMA queues (if DMA
is enabled), then finalise software and hardware state for each queue.
However, for EF10 we must ask the MC to finalise each queue, which
implicitly starts flushing it, and then wait for the flush events.
We therefore need to delegate more of this to the NIC type.
Combine all the hardware operations into a new NIC-type operation
efx_nic_type::fini_dmaq, and call this before tearing down the
software state and buffers for all the DMA queues.
Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
efx_unregister_netdev() should not call efx_release_tx_buffers()
directly, as it is already done when closing the device:
efx_net_stop() -> efx_stop_all() -> efx_stop_datapath() ->
efx_fini_tx_queue() -> efx_release_tx_buffers().
(This was presumably a workaround for a race between efx_stop_all()
and the data path that has since been properly fixed.)
Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
rx_queue::enabled guards refill, so rename it to reflect that. Clear
it at the start of the queue teardown process rather than waiting for
the RX queue to be flushed.
Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
We unconditionally acknowledge legacy interrupts just before disabling
them. This workaround is needed on Falcon A1 but probably not on
later chips where the legacy interrupt mechanism is different. It was
also originally done after the IRQ handler was removed, not before.
Restore the original behaviour for Falcon A1 only by doing this
acknowledgement in the efx_nic_type::fini operation.
Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
There are many problems with the current efx_stop_interrupts() and
efx_start_interrupts():
1. On Siena, it is unsafe to disable the master IRQ enable bit
(DRV_INT_EN_KER) while any IRQ sources are enabled.
2. On EF10 there is no master IRQ enable bit, so we cannot expect to
defer IRQs without tearing down event queues. (Though I don't think
we will need to keep any event queues around while the device is down,
as we do for VFDI on Siena.)
3. synchronize_irq() only waits for a running IRQ handler to finish,
not for any propagation through IRQ controllers. Therefore an IRQ may
still be received and handled after efx_stop_interrupts() returns.
IRQ handlers can then race with channel reallocation.
To fix this:
a. Introduce a software IRQ enable flag. So long as this is clear,
IRQ handlers will only acknowledge IRQs and not touch the channel
structures.
b. Define a new struct efx_msi_context as the context for MSIs. This
is never reallocated and is sufficient to find the software enable
flag and the channel structure. It also includes the channel/IRQ
name, which was previously separated out as it must also not be
reallocated.
c. Split efx_{start,stop}_interrupts() into
efx_{,soft_}_{enable,disable}_interrupts(). The 'soft' functions
don't touch the hardware master enable flag (if it exists) and don't
reinitialise or tear down channels with the keep_eventq flag set.
Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
efx_process_channel_now() is unneeded since self-tests can rely on
normal NAPI polling. Remove it and all calls to it.
efx_channel::work_pending and efx_channel_processed() are also
unneeded (the latter being the same as efx_nic_eventq_read_ack()).
Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
The EF10 architecture has a very different register layout from
previous controllers, so we'll use separate files for the two sets of
register definitions. Use 'farch' as an abbreviation for
Falcon-architecture.
Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
On EF10, the firmware is in charge of allocating buffer table entries.
Change struct efx_special_buffer to use a struct efx_buffer member,
so that it can be used with efx_nic_{alloc,free}_buffer() in that
case.
Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
Most call sites for efx_nic_alloc_buffer() are part of the probe or
reconfiguration paths and can allocate with GFP_KERNEL. A few others
should use GFP_NOIO (I think). Only one is in atomic context and
must use the current GFP_ATOMIC.
Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
Move the lowest layer (transport) of the current MCDI code to
per-NIC-type operations.
Introduce a new structure and efx_nic member for MCDI-specific data.
Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
This should probably be done during MCDI initialisation for any NIC.
Change efx_mcdi_init() to return an error code.
Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
Collect together MCDI port functions from mcdi.c, mcdi_mac.c,
mcdi_phy.c and siena.c. Rename the 'siena' functions accordingly.
Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
We currently require that MCDI request and response lengths are
multiples of 4 bytes, because we will copy dwords in and out of shared
memory and we want to be sure we won't read or write out of bounds.
But all we really need to know is that there is sufficient padding for
that. Also, we should ensure that buffers are dword-aligned, as on
some architectures misaligned access will result in data corruption or
a crash.
Change the buffer type to array-of-efx_dword_t and remove the
requirement that the lengths are multiples of 4.
Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
A few functions are using heap buffers; change them to use stack
buffers as we really don't need to resort to the heap for a 252
byte buffer in process context.
MC_CMD_MEMCPY is quite weird in that it can use inline data placed in
the request buffer after the array of records. Thus there are two
variable-length arrays and we can't use the normal accessors for
the second. So we have to use _MCDI_PTR() in efx_sriov_memcpy().
Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
We need to access arrays of 16-bit words and 32-bit dwords in MCDI
buffers based on the MCDI protocol definitions.
We should also be able to read and write fields within structures,
without specifying an array index each time. So add MCDI_FIELD()
and make MCDI_ARRAY_FIELD() use it. Also add MCDI_SET_FIELD().
Split MCDI_ARRAY_PTR() into MCDI_ARRAY_STRUCT_PTR() and
_MCDI_ARRAY_PTR(), which are currently identical but will diverge in
later changes.
Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
Add _MCDI_DWORD() which yields an lvalue for the given dword field
and change MCDI_DWORD(), MCDI_SET_DWORD() and MCDI_QWORD() to use it.
Fold the rather trivial MCDI_PTR2() into MCDI_PTR() and _MCDI_DWORD().
Remove MCDI_SET_DWORD2() and MCDI_QWORD2(). MCDI_DWORD2() should also
go, but it still has one user which we'll get rid of later.
Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
MCDI_DECLARE_BUF declares a variable as an MCDI buffer of the
requested length, adding any necessary padding.
Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>