Use ib_alloc_mr with specific parameters.
Change the existing callers.
Signed-off-by: Sagi Grimberg <sagig@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
This was added in a thought of uniting all mr allocation
and deallocation routines but the fact is we have a single
deallocation routine already, ib_dereg_mr.
And, move mlx5_ib_destroy_mr specific logic into mlx5_ib_dereg_mr
(includes only signature stuff for now).
And, fixup the only callers (iser/isert) accordingly.
Signed-off-by: Sagi Grimberg <sagig@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
When no matching listening ID is found for a given request, the net_dev
that was used to find the request isn't released.
Fixes: 0b3ca768fc ("IB/cma: Use found net_dev for passive connections")
Signed-off-by: Haggai Eran <haggaie@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Now that there are no ib_cm clients using the compare_data feature for
matching IB CM requests' private data, remove the compare_data parameter of
ib_cm_listen and remove the code implementing the feature.
Signed-off-by: Haggai Eran <haggaie@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Use ib_cm_insert_listen to create listening IB CM IDs or share existing
ones if needed. When given a request on a specific CM ID, the code now
matches the request to the RDMA CM ID based on the request parameters, so
it no longer needs to rely on the ib_cm's private data matching
capabilities.
Signed-off-by: Haggai Eran <haggaie@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
When receiving a new connection in cma_req_handler, we actually already
know the net_dev that is used for the connection's creation. Instead of
calling cma_translate_addr to resolve the new connection id's source
address, just use the net_dev that was found.
Signed-off-by: Haggai Eran <haggaie@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Pass incoming request parameters through the relevant IPv4/IPv6 routing
tables and make sure the network stack is configured to handle such
requests.
Signed-off-by: Haggai Eran <haggaie@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Instead of relying on a the ib_cm module to check an incoming CM request's
private data header, add these checks to the RDMA CM module. This allows a
following patch to to clean up the ib_cm interface and remove the code that
looks into the private headers. It will also allow supporting namespaces in
RDMA CM by making these checks namespace aware later on.
Signed-off-by: Haggai Eran <haggaie@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
The rdma_cm module will later use the P_Key from the BTH to de-mux
requests.
See discussion at:
http://www.spinics.net/lists/netdev/msg336067.html
Cc: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Cc: Liran Liss <liranl@mellanox.com>
Signed-off-by: Haggai Eran <haggaie@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Add helper functions to access the IDRs by port-space and port number.
Pass around the port-space enum in cma.c instead of using pointers to
port-space IDRs.
Signed-off-by: Haggai Eran <haggaie@mellanox.com>
Signed-off-by: Yotam Kenneth <yotamke@mellanox.com>
Signed-off-by: Shachar Raindel <raindel@mellanox.com>
Signed-off-by: Guy Shapiro <guysh@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
When receiving a connection request, rdma_cm needs to associate the request
with a network device, in order to disambiguate requests. To do this, it
needs to know the request's destination IP. For this the module needs to
allow getting this information from the private data in the request packet,
instead of relying on the information already being in the listening RDMA
CM ID.
When creating a new incoming connection ID, the code in
cma_save_ip{4,6}_info can no longer rely on the listener's private data to
find the port number, so it reads it from the requested service ID.
Signed-off-by: Guy Shapiro <guysh@mellanox.com>
Signed-off-by: Haggai Eran <haggaie@mellanox.com>
Signed-off-by: Yotam Kenneth <yotamke@mellanox.com>
Signed-off-by: Shachar Raindel <raindel@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Enabling network namespaces for RDMA CM will allow processes on different
namespaces to listen on the same port. In order to leave namespace support
out of the CM layer, this requires that multiple RDMA CM IDs will be able
to share a single CM ID.
This patch adds infrastructure to retrieve an existing listening ib_cm_id,
based on its device and service ID, or create a new one if one does not
already exist. It also adds a reference count for such instances
(cm_id_private.listen_sharecount), and prevents cm_destroy_id from
destroying a CM if it is still shared. See the relevant discussion [1].
[1] Re: [PATCH v3 for-next 05/13] IB/cm: Reference count ib_cm_ids
http://www.spinics.net/lists/netdev/msg328860.html
Reviewed-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Signed-off-by: Haggai Eran <haggaie@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Expose the service ID on an incoming CM or SIDR request to the event
handler. This will allow the RDMA CM module to de-multiplex connection
requests based on the information encoded in the service ID.
Acked-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Haggai Eran <haggaie@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Implement the get_net_device_by_port_pkey_ip callback that returns network
device to ib_core according to connection parameters. Check the ipoib
device and iterate over all child devices to look for a match.
For each IPoIB device we iterate through all upper devices when searching
for a matching IP, in order to support bonding.
Signed-off-by: Guy Shapiro <guysh@mellanox.com>
Signed-off-by: Haggai Eran <haggaie@mellanox.com>
Signed-off-by: Yotam Kenneth <yotamke@mellanox.com>
Signed-off-by: Shachar Raindel <raindel@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
In the case of IPoIB, and maybe in other cases, the network device is
managed by an upper-layer protocol (ULP). In order to expose this
network device to other users of the IB device, let ULPs implement
a callback that returns network device according to connection parameters.
The IB device and port, together with the P_Key and the GID should
be enough to uniquely identify the ULP net device. However, in current
kernels there can be multiple IPoIB interfaces created with the same GID.
Furthermore, such configuration may be desireable to support ipvlan-like
configurations for RDMA CM with IPoIB. To resolve the device in these
cases the code will also take the IP address as an additional input.
Reviewed-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Signed-off-by: Haggai Eran <haggaie@mellanox.com>
Signed-off-by: Yotam Kenneth <yotamke@mellanox.com>
Signed-off-by: Shachar Raindel <raindel@mellanox.com>
Signed-off-by: Guy Shapiro <guysh@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
An ib_client callback that is called with the lists_rwsem locked only for
read is protected from changes to the IB client lists, but not from
ib_unregister_device() freeing its client data. This is because
ib_unregister_device() will remove the device from the device list with
lists_rwsem locked for write, but perform the rest of the cleanup,
including the call to remove() without that lock.
Mark client data that is undergoing de-registration with a new going_down
flag in the client data context. Lock the client data list with lists_rwsem
for write in addition to using the spinlock, so that functions calling the
callback would be able to lock only lists_rwsem for read and let callbacks
sleep.
Since ib_unregister_client() now marks the client data context, no need for
remove() to search the context again, so pass the client data directly to
remove() callbacks.
Reviewed-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Signed-off-by: Haggai Eran <haggaie@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Currently the RDMA subsystem's device list and client list are protected by
a single mutex. This prevents adding user-facing APIs that iterate these
lists, since using them may cause a deadlock. The patch attempts to solve
this problem by adding a read-write semaphore to protect the lists. Readers
now don't need the mutex, and are safe just by read-locking the semaphore.
The ib_register_device, ib_register_client, ib_unregister_device, and
ib_unregister_client functions are modified to lock the semaphore for write
during their respective list modification. Also, in order to make sure
client callbacks are called only between add() and remove() calls, the code
is changed to only add items to the lists after the add() calls and remove
from the lists before the remove() calls.
This patch attempts to solve a similar need [1] that was seen in the RoCE
v2 patch series.
[1] http://www.spinics.net/lists/linux-rdma/msg24733.html
Reviewed-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Cc: Matan Barak <matanb@mellanox.com>
Signed-off-by: Haggai Eran <haggaie@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
This functionality already exists via the max_sge_rd
device capability.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Applications must not assume that max_sge and max_sge_rd are the same,
Hence expose max_sge_rd correctly as well.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Acked-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Applications must not assume that max_sge and max_sge_rd are the same,
Hence expose max_sge_rd correctly as well.
Reported-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Sagi Grimberg <sagig@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Running checkpatch.pl on mad.c produces several
"ERROR: code indent should use tabs where possible" messages.
This patch fixes these.
Signed-off-by: Jeff Becker <Jeffrey.C.Becker@nasa.gov>
Signed-off-by: Doug Ledford <dledford@redhat.com>
This patch adds the value of the CNP opcode to the existing list of enumerated
opcodes in ib_pack.h
Add common OPA header definitions for driver
build:
- opa_port_info.h
- opa_smi.h
- hfi1_user.h
Additionally, ib_mad.h, has additional definitions
that are common to ib_drivers including:
- trap support
- cca support
The qib driver has the duplication removed in favor
those in ib_mad.h
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Reviewed-by: John, Jubin <jubin.john@intel.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
The HW hasn't been sold since 2005, and the SW has definite bit rot.
Its time to remove it. So move it to staging for a few releases and
then remove it after that.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
It is now time for the ipath driver to begin to be phased out of the kernel.
This patch moves the ipath driver from the Infiniband sub tree to the staging
area where it will remain until the code is removed from the kernel in a few
releases.
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Create the rdma directory in the staging area for use as we deprecate
some older drivers and as we bring in some new drivers that are in
need of work. Update the MAINTAINERS file so that updates to these
files go to linux-rdma@vger.kernel.org. Expected lifespan of this
directory is three releases for any deprecated drivers moved here
and an unknown, but theoretically bounded amount of time for the new
drivers as a new core RDMA transfer library needs to be written and
the drivers modified to use it in order for them to move out of this
directory.
Signed-off-by: Doug Ledford <dledford@redhat.com>
Add support for ipv6 address handling clip api provided by lld
Signed-off-by: Hariprasad Shenai <hariprasad@chelsio.com>
Acked-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Resolving a link-local IPv6 address with an unspecified source address
was broken by commit 5462eddd7a, which prevented the IPv6 stack from
learning the scope id of the link-local IPv6 address, causing random
failures as the IP stack chose a random link to resolve the address on.
This commit 5462eddd7a made us bail out of cma_check_linklocal early if
the address passed in was not an IPv6 link-local address. On the address
resolution path, the address passed in is the source address; if the
source address is the unspecified address, which is not link-local, we
will bail out early.
This is mostly correct, but if the destination address is a link-local
address, then we will be following a link-local route, and we'll need to
tell the IPv6 stack what the scope id of the destination address is.
This used to be done by last line of cma_check_linklocal, which is
skipped when bailing out early:
dev_addr->bound_dev_if = sin6->sin6_scope_id;
(In cma_bind_addr, the sin6_scope_id of the source address is set to the
sin6_scope_id of the destination address, so this is correct)
This line is required in turn for the following line, L279 of
addr6_resolve, to actually inform the IPv6 stack of the scope id:
fl6.flowi6_oif = addr->bound_dev_if;
Since we can only know we are in this failure case when we have access
to both the source IPv6 address and destination IPv6 address, we have to
deal with this further up the stack. So detect this failure case in
cma_bind_addr, and set bound_dev_if to the destination address scope id
to correct it.
Signed-off-by: Spencer Baugh <sbaugh@catern.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Something like this:
CPU A CPU B
Acked-by: Sean Hefty <sean.hefty@intel.com>
======================== ================================
ucma_destroy_id()
wait_for_completion()
.. anything
ucma_put_ctx()
complete()
.. continues ...
ucma_leave_multicast()
mutex_lock(mut)
atomic_inc(ctx->ref)
mutex_unlock(mut)
ucma_free_ctx()
ucma_cleanup_multicast()
mutex_lock(mut)
kfree(mc)
rdma_leave_multicast(mc->ctx->cm_id,..
Fix it by latching the ref at 0. Once it goes to 0 mc and ctx cannot
leave the mutex(mut) protection.
The other atomic_inc in ucma_get_ctx is OK because mutex(mut) protects
it from racing with ucma_destroy_id.
Signed-off-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Acked-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
This enables ORD/IRD negotiation and its about time to enable it by
default
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Hariprasad Shenai <hariprasad@chelsio.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Currently the sg tablesize, which dictates fast register page list
depth to use, does not take into account the limits of the rdma device.
So adjust it once we discover the device fastreg max depth limit. Also
adjust the max_sectors based on the resulting sg tablesize.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
The only place that assigns mr inside the loop already does a break.
So "if (mr)" will never be true here since the function initializes mr
to NULL at the top. We can just drop the extra if and break here.
Signed-off-by: Roland Dreier <roland@purestorage.com>
Acked-by: Eli Cohen <eli@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
The lkey table is allocated with with a get_user_pages() with an
order based on a number of index bits from a module parameter.
The underlying kernel code cannot allocate that many contiguous pages.
There is no reason the underlying memory needs to be physically
contiguous.
This patch:
- switches the allocation/deallocation to vmalloc/vfree
- caps the number of bits to 23 to insure at least 1 generation bit
o this matches the module parameter description
Cc: stable@vger.kernel.org
Reviewed-by: Vinit Agnihotri <vinit.abhay.agnihotri@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Should be all the page sizes that are supported by the
device.
Reported-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Signed-off-by: Sagi Grimberg <sagig@mellanox.com>
Reviewed-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
The mlx5 driver exposes device capability IB_DEVICE_LOCAL_DMA_LKEY
but does not set the the device local_dma_lkey. This breaks
rpcrdma drivers.
Query and set this lkey when creating the device resources.
Signed-off-by: Sagi Grimberg <sagig@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
A couple of major (hang and deadlock) fixes with fortunately fairly rare
triggering conditions. The PM oops is only really triggered by people using
enclosure services (rare) and the fnic driver is mostly used in enterprise
environments.
Signed-off-by: James Bottomley <JBottomley@Odin.com>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQEcBAABAgAGBQJV2e34AAoJEDeqqVYsXL0MywIH/0ZzZzofgUammzkjalMxoW1b
rojSyB1bpuADc8eXpqsw1x6coNxKK85e9aAmplXqdykgazw44lzkH43Vez7gbwGN
JG5+utu2hQUMJYF9bQ3NLPu5JgxgP0w6aqY1L1ZndKFrRmEnM53UcojNT3M2NIN3
cgA5ICDD0RSQy24KDSZaN+y3SvopEcY5eXUcLfshrwXI3yAIH4G39z8hQHCFGHZB
BkYq9qjI5T4P7PRE5trRYu7B9rO8IJpoYdPnmI3i49jIyJlpFXP3FlLSmL+gCzyO
FBhdln9sCulCWnTirRf7Gsbq6LfXpihtzzgfCtPoxxwI4PRcIBh0jxu3Od8o9NY=
=jY8K
-----END PGP SIGNATURE-----
Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
Pull SCSI fixes from James Bottomley:
"A couple of major (hang and deadlock) fixes with fortunately fairly
rare triggering conditions. The PM oops is only really triggered by
people using enclosure services (rare) and the fnic driver is mostly
used in enterprise environments"
* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
SCSI: Fix NULL pointer dereference in runtime PM
fnic: Use the local variable instead of I/O flag to acquire io_req_lock in fnic_queuecommand() to avoid deadloack
Pull MIPS bug fixes from Ralf Baechle:
"Two more fixes for 4.2.
One fixes a build issue with the LLVM assembler - LLVM assembler macro
names are case sensitive, GNU as macro names are insensitive; the
other corrects a license string (GPL v2, not GPLv2) such that the
module loader will recognice the license correctly"
* 'upstream' of git://git.linux-mips.org/pub/scm/ralf/upstream-linus:
FIRMWARE: bcm47xx_nvram: Fix module license.
MIPS: Fix LLVM build issue.
Pull 9p regression fix from Al Viro:
"Fix for breakage introduced when switching p9_client_{read,write}() to
struct iov_iter * (went into 4.1)"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
9p: ensure err is initialized to 0 in p9_client_read/write
Some use of those functions were providing unitialized values to those
functions. Notably, when reading 0 bytes from an empty file on a 9P
filesystem, the return code of read() was not 0.
Tested with this simple program:
#include <assert.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <unistd.h>
int main(int argc, const char **argv)
{
assert(argc == 2);
char buffer[256];
int fd = open(argv[1], O_RDONLY|O_NOCTTY);
assert(fd >= 0);
assert(read(fd, buffer, 0) == 0);
return 0;
}
Cc: stable@vger.kernel.org # v4.1
Signed-off-by: Vincent Bernat <vincent@bernat.im>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Pull ARM fixes from Russell King:
"Another couple of small ARM fixes.
A patch from Masahiro Yamada who noticed that "make -jN all zImage"
would end up generating bad images where N > 1, and a patch from
Nicolas to fix the Marvell CPU user access optimisation code when page
faults are disabled"
* 'fixes' of git://ftp.arm.linux.org.uk/~rmk/linux-arm:
ARM: 8418/1: add boot image dependencies to not generate invalid images
ARM: 8414/1: __copy_to_user_memcpy: fix mmap semaphore usage
Pull x86 fixes from Ingo Molnar:
"Various low level fixes: fix more fallout from the FPU rework and the
asm entry code rework, plus an MSI rework fix, and an idle-tracing fix"
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/fpu/math-emu: Fix crash in fork()
x86/fpu/math-emu: Fix math-emu boot crash
x86/idle: Restore trace_cpu_idle to mwait_idle() calls
x86/irq: Build correct vector mapping for multiple MSI interrupts
Revert "sched/x86_64: Don't save flags on context switch"
Pull perf fixes from Ingo Molnar:
"Tooling fixes: a 'perf record' deadlock fix plus debuggability fixes"
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf top: Show backtrace when handling a SIGSEGV on --stdio mode
perf tools: Fix buildid processing
perf tools: Make fork event processing more resilient
perf tools: Avoid deadlock when map_groups are broken
Pull irq fixes from Thomas Gleixner:
"A series of small fixlets for a regression visible on OMAP devices
caused by the conversion of the OMAP interrupt chips to hierarchical
interrupt domains. Mostly one liners on the driver side plus a small
helper function in the core to avoid open coded mess in the drivers"
* 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
irqchip/crossbar: Restore set_wake functionality
irqchip/crossbar: Restore the mask on suspend behaviour
ARM: OMAP: wakeupgen: Restore the irq_set_type() mechanism
irqchip/crossbar: Restore the irq_set_type() mechanism
genirq: Introduce irq_chip_set_type_parent() helper
genirq: Don't return ENOSYS in irq_chip_retrigger_hierarchy
Pull timer fixes from Thomas Gleixner:
"Two minimalistic fixes for 4.2 regressions:
- Eric fixed a thinko in the timer_list base switching code caused by
the overhaul of the timer wheel. It can cause a cpu to see the
wrong base for a timer while we move the timer around.
- Guenter fixed a regression for IMX if booted w/o device tree, where
the timer interrupt is not initialized and therefor the machine
fails to boot"
* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
clocksource/imx: Fix boot with non-DT systems
timer: Write timer->flags atomically
During later stages of math-emu bootup the following crash triggers:
math_emulate: 0060:c100d0a8
Kernel panic - not syncing: Math emulation needed in kernel
CPU: 0 PID: 1511 Comm: login Not tainted 4.2.0-rc7+ #1012
[...]
Call Trace:
[<c181d50d>] dump_stack+0x41/0x52
[<c181c918>] panic+0x77/0x189
[<c1003530>] ? math_error+0x140/0x140
[<c164c2d7>] math_emulate+0xba7/0xbd0
[<c100d0a8>] ? fpu__copy+0x138/0x1c0
[<c1109c3c>] ? __alloc_pages_nodemask+0x12c/0x870
[<c136ac20>] ? proc_clear_tty+0x40/0x70
[<c136ac6e>] ? session_clear_tty+0x1e/0x30
[<c1003530>] ? math_error+0x140/0x140
[<c1003575>] do_device_not_available+0x45/0x70
[<c100d0a8>] ? fpu__copy+0x138/0x1c0
[<c18258e6>] error_code+0x5a/0x60
[<c1003530>] ? math_error+0x140/0x140
[<c100d0a8>] ? fpu__copy+0x138/0x1c0
[<c100c205>] arch_dup_task_struct+0x25/0x30
[<c1048cea>] copy_process.part.51+0xea/0x1480
[<c115a8e5>] ? dput+0x175/0x200
[<c136af70>] ? no_tty+0x30/0x30
[<c1157242>] ? do_vfs_ioctl+0x322/0x540
[<c104a21a>] _do_fork+0xca/0x340
[<c1057b06>] ? SyS_rt_sigaction+0x66/0x90
[<c104a557>] SyS_clone+0x27/0x30
[<c1824a80>] sysenter_do_call+0x12/0x12
The reason is the incorrect assumption in fpu_copy(), that FNSAVE
can be executed from math-emu kernels as well.
Don't try to copy the registers, the soft state will be copied
by fork anyway, so the child task inherits the parent task's
soft math state.
With this fix applied math-emu kernels boot up fine on modern
hardware and the 'no387 nofxsr' boot options.
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Bobby Powers <bobbypowers@gmail.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Quentin Casasnovas <quentin.casasnovas@oracle.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
On a math-emu bootup the following crash occurs:
Initializing CPU#0
------------[ cut here ]------------
kernel BUG at arch/x86/kernel/traps.c:779!
invalid opcode: 0000 [#1] SMP
[...]
EIP is at do_device_not_available+0xe/0x70
[...]
Call Trace:
[<c18238e6>] error_code+0x5a/0x60
[<c1002bd0>] ? math_error+0x140/0x140
[<c100bbd9>] ? fpu__init_cpu+0x59/0xa0
[<c1012322>] cpu_init+0x202/0x330
[<c104509f>] ? __native_set_fixmap+0x1f/0x30
[<c1b56ab0>] trap_init+0x305/0x346
[<c1b548af>] start_kernel+0x1a5/0x35d
[<c1b542b4>] i386_start_kernel+0x82/0x86
The reason is that in the following commit:
b1276c48e9 ("x86/fpu: Initialize fpregs in fpu__init_cpu_generic()")
I failed to consider math-emu's limitation that it cannot execute the
FNINIT instruction in kernel mode.
The long term fix might be to allow math-emu to execute (certain) kernel
mode FPU instructions, but for now apply the safe (albeit somewhat ugly)
fix: initialize the emulation state explicitly without trapping out to
the FPU emulator.
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Quentin Casasnovas <quentin.casasnovas@oracle.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
While the idea behind get_maintainer seems highly useful it's
unfortunately way to trigger happy to grab people that once had a few
commits to files. For someone like me who does a lot of tree-wide API
work that leads to an incredible amount of Cc spam.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Cc: Joe Perches <joe@perches.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit c48a11c7ad ("netvm: propagate page->pfmemalloc to skb") added
checks for page->pfmemalloc to __skb_fill_page_desc():
if (page->pfmemalloc && !page->mapping)
skb->pfmemalloc = true;
It assumes page->mapping == NULL implies that page->pfmemalloc can be
trusted. However, __delete_from_page_cache() can set set page->mapping
to NULL and leave page->index value alone. Due to being in union, a
non-zero page->index will be interpreted as true page->pfmemalloc.
So the assumption is invalid if the networking code can see such a page.
And it seems it can. We have encountered this with a NFS over loopback
setup when such a page is attached to a new skbuf. There is no copying
going on in this case so the page confuses __skb_fill_page_desc which
interprets the index as pfmemalloc flag and the network stack drops
packets that have been allocated using the reserves unless they are to
be queued on sockets handling the swapping which is the case here and
that leads to hangs when the nfs client waits for a response from the
server which has been dropped and thus never arrive.
The struct page is already heavily packed so rather than finding another
hole to put it in, let's do a trick instead. We can reuse the index
again but define it to an impossible value (-1UL). This is the page
index so it should never see the value that large. Replace all direct
users of page->pfmemalloc by page_is_pfmemalloc which will hide this
nastiness from unspoiled eyes.
The information will get lost if somebody wants to use page->index
obviously but that was the case before and the original code expected
that the information should be persisted somewhere else if that is
really needed (e.g. what SLAB and SLUB do).
[akpm@linux-foundation.org: fix blooper in slub]
Fixes: c48a11c7ad ("netvm: propagate page->pfmemalloc to skb")
Signed-off-by: Michal Hocko <mhocko@suse.com>
Debugged-by: Vlastimil Babka <vbabka@suse.com>
Debugged-by: Jiri Bohac <jbohac@suse.com>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: David Miller <davem@davemloft.net>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: <stable@vger.kernel.org> [3.6+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>