Currently, if we cannot create a association to the first IP addr
that is added to DLM, the SCTP init assoc code will just retry
the same IP. This patch adds a simple failover schemes where we
will try one of the addresses that was passed into DLM.
Signed-off-by: Mike Christie <michaelc@cs.wisc.edu>
Signed-off-by: David Teigland <teigland@redhat.com>
We should be testing and cleaing the init pending bit because later
when sctp_init_assoc is recalled it will be checking that it is not set
and set the bit.
We do not want to touch CF_CONNECT_PENDING here because we will queue
swork and process_send_sockets will then call the connect_action function.
Signed-off-by: Mike Christie <michaelc@cs.wisc.edu>
Signed-off-by: David Teigland <teigland@redhat.com>
sctp_assoc was not getting set so later lookups failed.
Signed-off-by: Mike Christie <michaelc@cs.wisc.edu>
Signed-off-by: David Teigland <teigland@redhat.com>
We were clearing the base con's init pending flags, but the
con for the node was the one with the pending bit set.
Signed-off-by: Mike Christie <michaelc@cs.wisc.edu>
Signed-off-by: David Teigland <teigland@redhat.com>
Pull networking updates from David Miller:
"Highlights (1721 non-merge commits, this has to be a record of some
sort):
1) Add 'random' mode to team driver, from Jiri Pirko and Eric
Dumazet.
2) Make it so that any driver that supports configuration of multiple
MAC addresses can provide the forwarding database add and del
calls by providing a default implementation and hooking that up if
the driver doesn't have an explicit set of handlers. From Vlad
Yasevich.
3) Support GSO segmentation over tunnels and other encapsulating
devices such as VXLAN, from Pravin B Shelar.
4) Support L2 GRE tunnels in the flow dissector, from Michael Dalton.
5) Implement Tail Loss Probe (TLP) detection in TCP, from Nandita
Dukkipati.
6) In the PHY layer, allow supporting wake-on-lan in situations where
the PHY registers have to be written for it to be configured.
Use it to support wake-on-lan in mv643xx_eth.
From Michael Stapelberg.
7) Significantly improve firewire IPV6 support, from YOSHIFUJI
Hideaki.
8) Allow multiple packets to be sent in a single transmission using
network coding in batman-adv, from Martin Hundebøll.
9) Add support for T5 cxgb4 chips, from Santosh Rastapur.
10) Generalize the VXLAN forwarding tables so that there is more
flexibility in configurating various aspects of the endpoints.
From David Stevens.
11) Support RSS and TSO in hardware over GRE tunnels in bxn2x driver,
from Dmitry Kravkov.
12) Zero copy support in nfnelink_queue, from Eric Dumazet and Pablo
Neira Ayuso.
13) Start adding networking selftests.
14) In situations of overload on the same AF_PACKET fanout socket, or
per-cpu packet receive queue, minimize drop by distributing the
load to other cpus/fanouts. From Willem de Bruijn and Eric
Dumazet.
15) Add support for new payload offset BPF instruction, from Daniel
Borkmann.
16) Convert several drivers over to mdoule_platform_driver(), from
Sachin Kamat.
17) Provide a minimal BPF JIT image disassembler userspace tool, from
Daniel Borkmann.
18) Rewrite F-RTO implementation in TCP to match the final
specification of it in RFC4138 and RFC5682. From Yuchung Cheng.
19) Provide netlink socket diag of netlink sockets ("Yo dawg, I hear
you like netlink, so I implemented netlink dumping of netlink
sockets.") From Andrey Vagin.
20) Remove ugly passing of rtnetlink attributes into rtnl_doit
functions, from Thomas Graf.
21) Allow userspace to be able to see if a configuration change occurs
in the middle of an address or device list dump, from Nicolas
Dichtel.
22) Support RFC3168 ECN protection for ipv6 fragments, from Hannes
Frederic Sowa.
23) Increase accuracy of packet length used by packet scheduler, from
Jason Wang.
24) Beginning set of changes to make ipv4/ipv6 fragment handling more
scalable and less susceptible to overload and locking contention,
from Jesper Dangaard Brouer.
25) Get rid of using non-type-safe NLMSG_* macros and use nlmsg_*()
instead. From Hong Zhiguo.
26) Optimize route usage in IPVS by avoiding reference counting where
possible, from Julian Anastasov.
27) Convert IPVS schedulers to RCU, also from Julian Anastasov.
28) Support cpu fanouts in xt_NFQUEUE netfilter target, from Holger
Eitzenberger.
29) Network namespace support for nf_log, ebt_log, xt_LOG, ipt_ULOG,
nfnetlink_log, and nfnetlink_queue. From Gao feng.
30) Implement RFC3168 ECN protection, from Hannes Frederic Sowa.
31) Support several new r8169 chips, from Hayes Wang.
32) Support tokenized interface identifiers in ipv6, from Daniel
Borkmann.
33) Use usbnet_link_change() helper in USB net driver, from Ming Lei.
34) Add 802.1ad vlan offload support, from Patrick McHardy.
35) Support mmap() based netlink communication, also from Patrick
McHardy.
36) Support HW timestamping in mlx4 driver, from Amir Vadai.
37) Rationalize AF_PACKET packet timestamping when transmitting, from
Willem de Bruijn and Daniel Borkmann.
38) Bring parity to what's provided by /proc/net/packet socket dumping
and the info provided by netlink socket dumping of AF_PACKET
sockets. From Nicolas Dichtel.
39) Fix peeking beyond zero sized SKBs in AF_UNIX, from Benjamin
Poirier"
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1722 commits)
filter: fix va_list build error
af_unix: fix a fatal race with bit fields
bnx2x: Prevent memory leak when cnic is absent
bnx2x: correct reading of speed capabilities
net: sctp: attribute printl with __printf for gcc fmt checks
netlink: kconfig: move mmap i/o into netlink kconfig
netpoll: convert mutex into a semaphore
netlink: Fix skb ref counting.
net_sched: act_ipt forward compat with xtables
mlx4_en: fix a build error on 32bit arches
Revert "bnx2x: allow nvram test to run when device is down"
bridge: avoid OOPS if root port not found
drivers: net: cpsw: fix kernel warn on cpsw irq enable
sh_eth: use random MAC address if no valid one supplied
3c509.c: call SET_NETDEV_DEV for all device types (ISA/ISAPnP/EISA)
tg3: fix to append hardware time stamping flags
unix/stream: fix peeking with an offset larger than data in queue
unix/dgram: fix peeking with an offset larger than data in queue
unix/dgram: peek beyond 0-sized skbs
openvswitch: Remove unneeded ovs_netdev_get_ifindex()
...
This patch introduces an UAPI header for the SCTP protocol,
so that we can facilitate the maintenance and development of
user land applications or libraries, in particular in terms
of header synchronization.
To not break compatibility, some fragments from lksctp-tools'
netinet/sctp.h have been carefully included, while taking care
that neither kernel nor user land breaks, so both compile fine
with this change (for lksctp-tools I tested with the old
netinet/sctp.h header and with a newly adapted one that includes
the uapi sctp header). lksctp-tools smoke test run through
successfully as well in both cases.
Suggested-by: Neil Horman <nhorman@tuxdriver.com>
Cc: Neil Horman <nhorman@tuxdriver.com>
Cc: Vlad Yasevich <vyasevich@gmail.com>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When the kernel clears flocks/plocks during close, it calls posix
unlock when there are flocks but no posix locks. Without this
patch, that unnecessary posix unlock is passed to userland
(dlm_controld), across the cluster, and back to the kernel.
This can create a lot of plock activity, even when no posix
locks had been used.
This patch copies the nfs approach, and skips the full posix
unlock if there is no plock found during the vfs unlock phase.
Signed-off-by: David Teigland <teigland@redhat.com>
I'm not sure why, but the hlist for each entry iterators were conceived
list_for_each_entry(pos, head, member)
The hlist ones were greedy and wanted an extra parameter:
hlist_for_each_entry(tpos, pos, head, member)
Why did they need an extra pos parameter? I'm not quite sure. Not only
they don't really need it, it also prevents the iterator from looking
exactly like the list iterator, which is unfortunate.
Besides the semantic patch, there was some manual work required:
- Fix up the actual hlist iterators in linux/list.h
- Fix up the declaration of other iterators based on the hlist ones.
- A very small amount of places were using the 'node' parameter, this
was modified to use 'obj->member' instead.
- Coccinelle didn't handle the hlist_for_each_entry_safe iterator
properly, so those had to be fixed up manually.
The semantic patch which is mostly the work of Peter Senna Tschudin is here:
@@
iterator name hlist_for_each_entry, hlist_for_each_entry_continue, hlist_for_each_entry_from, hlist_for_each_entry_rcu, hlist_for_each_entry_rcu_bh, hlist_for_each_entry_continue_rcu_bh, for_each_busy_worker, ax25_uid_for_each, ax25_for_each, inet_bind_bucket_for_each, sctp_for_each_hentry, sk_for_each, sk_for_each_rcu, sk_for_each_from, sk_for_each_safe, sk_for_each_bound, hlist_for_each_entry_safe, hlist_for_each_entry_continue_rcu, nr_neigh_for_each, nr_neigh_for_each_safe, nr_node_for_each, nr_node_for_each_safe, for_each_gfn_indirect_valid_sp, for_each_gfn_sp, for_each_host;
type T;
expression a,c,d,e;
identifier b;
statement S;
@@
-T b;
<+... when != b
(
hlist_for_each_entry(a,
- b,
c, d) S
|
hlist_for_each_entry_continue(a,
- b,
c) S
|
hlist_for_each_entry_from(a,
- b,
c) S
|
hlist_for_each_entry_rcu(a,
- b,
c, d) S
|
hlist_for_each_entry_rcu_bh(a,
- b,
c, d) S
|
hlist_for_each_entry_continue_rcu_bh(a,
- b,
c) S
|
for_each_busy_worker(a, c,
- b,
d) S
|
ax25_uid_for_each(a,
- b,
c) S
|
ax25_for_each(a,
- b,
c) S
|
inet_bind_bucket_for_each(a,
- b,
c) S
|
sctp_for_each_hentry(a,
- b,
c) S
|
sk_for_each(a,
- b,
c) S
|
sk_for_each_rcu(a,
- b,
c) S
|
sk_for_each_from
-(a, b)
+(a)
S
+ sk_for_each_from(a) S
|
sk_for_each_safe(a,
- b,
c, d) S
|
sk_for_each_bound(a,
- b,
c) S
|
hlist_for_each_entry_safe(a,
- b,
c, d, e) S
|
hlist_for_each_entry_continue_rcu(a,
- b,
c) S
|
nr_neigh_for_each(a,
- b,
c) S
|
nr_neigh_for_each_safe(a,
- b,
c, d) S
|
nr_node_for_each(a,
- b,
c) S
|
nr_node_for_each_safe(a,
- b,
c, d) S
|
- for_each_gfn_sp(a, c, d, b) S
+ for_each_gfn_sp(a, c, d) S
|
- for_each_gfn_indirect_valid_sp(a, c, d, b) S
+ for_each_gfn_indirect_valid_sp(a, c, d) S
|
for_each_host(a,
- b,
c) S
|
for_each_host_safe(a,
- b,
c, d) S
|
for_each_mesh_entry(a,
- b,
c, d) S
)
...+>
[akpm@linux-foundation.org: drop bogus change from net/ipv4/raw.c]
[akpm@linux-foundation.org: drop bogus hunk from net/ipv6/raw.c]
[akpm@linux-foundation.org: checkpatch fixes]
[akpm@linux-foundation.org: fix warnings]
[akpm@linux-foudnation.org: redo intrusive kvm changes]
Tested-by: Peter Senna Tschudin <peter.senna@gmail.com>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Convert to the much saner new idr interface. Error return values from
recover_idr_add() mix -1 and -errno. The conversion doesn't change
that but it looks iffy.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
idr_destroy() can destroy idr by itself and idr_remove_all() is being
deprecated.
The conversion isn't completely trivial for recover_idr_clear() as it's
the only place in kernel which makes legitimate use of idr_remove_all()
w/o idr_destroy(). Replace it with idr_remove() call inside
idr_for_each_entry() loop. It goes on top so that it matches the
operation order in recover_idr_del().
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Christine Caulfield <ccaulfie@redhat.com>
Cc: David Teigland <teigland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Convert recover_idr_clear() to use idr_for_each_entry() instead of
idr_for_each(). It's somewhat less efficient this way but it shouldn't
matter in an error path. This is to help with deprecation of
idr_remove_all().
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Christine Caulfield <ccaulfie@redhat.com>
Cc: David Teigland <teigland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull vfs pile (part one) from Al Viro:
"Assorted stuff - cleaning namei.c up a bit, fixing ->d_name/->d_parent
locking violations, etc.
The most visible changes here are death of FS_REVAL_DOT (replaced with
"has ->d_weak_revalidate()") and a new helper getting from struct file
to inode. Some bits of preparation to xattr method interface changes.
Misc patches by various people sent this cycle *and* ocfs2 fixes from
several cycles ago that should've been upstream right then.
PS: the next vfs pile will be xattr stuff."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (46 commits)
saner proc_get_inode() calling conventions
proc: avoid extra pde_put() in proc_fill_super()
fs: change return values from -EACCES to -EPERM
fs/exec.c: make bprm_mm_init() static
ocfs2/dlm: use GFP_ATOMIC inside a spin_lock
ocfs2: fix possible use-after-free with AIO
ocfs2: Fix oops in ocfs2_fast_symlink_readpage() code path
get_empty_filp()/alloc_file() leave both ->f_pos and ->f_version zero
target: writev() on single-element vector is pointless
export kernel_write(), convert open-coded instances
fs: encode_fh: return FILEID_INVALID if invalid fid_type
kill f_vfsmnt
vfs: kill FS_REVAL_DOT by adding a d_weak_revalidate dentry op
nfsd: handle vfs_getattr errors in acl protocol
switch vfs_getattr() to struct path
default SET_PERSONALITY() in linux/elf.h
ceph: prepopulate inodes only when request is aborted
d_hash_and_lookup(): export, switch open-coded instances
9p: switch v9fs_set_create_acl() to inode+fid, do it before d_instantiate()
9p: split dropping the acls from v9fs_set_create_acl()
...
According to SUSv3:
[EACCES] Permission denied. An attempt was made to access a file in a way
forbidden by its file access permissions.
[EPERM] Operation not permitted. An attempt was made to perform an operation
limited to processes with appropriate privileges or to the owner of a file
or other resource.
So -EPERM should be returned if capability checks fails.
Strictly speaking this is an API change since the error code user sees is
altered.
Signed-off-by: Zhao Hongjiang <zhaohongjiang@huawei.com>
Acked-by: Jan Kara <jack@suse.cz>
Acked-by: Steven Whitehouse <swhiteho@redhat.com>
Acked-by: Ian Kent <raven@themaw.net>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
This includes a single patch to avoid excessive and
unnecessary scanning of rsbs to free.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (GNU/Linux)
iQIcBAABAgAGBQJREX6ZAAoJEDgbc8f8gGmqDJ4QAKFuq+vwBb+21lNVDk1egBTf
nDcpuntzZfnqgg+VKC5LJm0bBAvgexH2GrQMIPDFPPMgAOFjTxXDvm93wVGeAK6d
JpkiZxW1a0lyjR+7NMvDFnoIdVJvKw+GezNITiVo/lMghl51NWOQTH8QmSBKdEdJ
cxwzm/ZV9gBS5bv5hfeq2hCDVD1q5jWohMjYHb+xyx2E/c4z8L8enEMR1yBBrB/4
qw3DgGRMc0eoWzZrXKt1F6MYtM8QnU/H+WxxmLUYc4SMbClIonivAzzOa0PHO5nV
YvewWwpo2VVT50EIxARXi0XQzJ0aYbTwo+E7KoG+MzK5sCmLJaP8inH5UnlltvEu
OyhZUkh7qPNjhUQokP3FlNWidZkZNM+gJW6hmZwXtSGfGH6pe620QBrbghcV3nCh
QXyENcKUcyDNJQSBQFKV/s4ql4RI+iDS3PfovyOdNqkEDWurWKx0AYcrn2dSNYt1
wjFPH94P3NOwz4nIabpYiICrEPXLdXw+CCvHz56DgBL7EFzrWAzC97v5arYpouhZ
NmsnzV8+PGHDWA+NM73fKGokW259pvk/zOPtM13zoyXP18X6adGKUqMpZA7jZge9
RHzO/+A+3ScMXaXdb3c2P7GYH/iSSJXBmcTg8BceOYxICTq1wh4HEzvm39vEwMop
3BMloPmzKxzfis69swIu
=WtiV
-----END PGP SIGNATURE-----
Merge tag 'dlm-3.9' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm
Pull dlm update from David Teigland:
"This includes a single patch to avoid excessive and unnecessary
scanning of rsbs to free."
* tag 'dlm-3.9' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm:
dlm: avoid scanning unchanged toss lists
Return EINVAL from write if the size is larger than
allowed. Do this before allocating kernel memory for
the bogus size, which could lead to OOM.
Reported-by: Sasha Levin <levinsasha928@gmail.com>
Tested-by: Jana Saout <jana@saout.de>
Signed-off-by: David Teigland <teigland@redhat.com>
Keep track of whether a toss list contains any
shrinkable rsbs. If not, dlm_scand can avoid
scanning the list for rsbs to shrink. Unnecessary
scanning can otherwise waste a lot of time because
the toss lists can contain a large number of rsbs
that are non-shrinkable (directory records).
Signed-off-by: David Teigland <teigland@redhat.com>
When a node is removed that held a PW/EX lock, the
existing master node should invalidate the lvb on the
resource due to the purged lock.
Previously, the existing master node was invalidating
the lvb if it found only NL/CR locks on the resource
during recovery for the removed node. This could lead
to cases where it invalidated the lvb and shouldn't
have, or cases where it should have invalidated and
didn't.
When recovery selects a *new* master node for a
resource, and that new master finds only NL/CR locks
on the resource after lock recovery, it should
invalidate the lvb. This case was handled correctly
(but was incorrectly applied to the existing master
case also.)
When a process exits while holding a PW/EX lock,
the lvb on the resource should be invalidated.
This was not happening.
The lvb contents and VALNOTVALID flag should be
recovered before granting locks in recovery so that
the recovered lvb state is provided in the callback.
The lvb was being recovered after the lock was granted.
Signed-off-by: David Teigland <teigland@redhat.com>
This config item has not carried much meaning for a while now and is
almost always enabled by default. As agreed during the Linux kernel
summit, remove it.
CC: Christine Caulfield <ccaulfie@redhat.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David Teigland <teigland@redhat.com>
The variable users is initialized but never used
otherwise, so remove the unused variable.
dpatch engine is used to auto generate this patch.
(https://github.com/weiyj/dpatch)
Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Signed-off-by: David Teigland <teigland@redhat.com>
Pull networking changes from David Miller:
1) GRE now works over ipv6, from Dmitry Kozlov.
2) Make SCTP more network namespace aware, from Eric Biederman.
3) TEAM driver now works with non-ethernet devices, from Jiri Pirko.
4) Make openvswitch network namespace aware, from Pravin B Shelar.
5) IPV6 NAT implementation, from Patrick McHardy.
6) Server side support for TCP Fast Open, from Jerry Chu and others.
7) Packet BPF filter supports MOD and XOR, from Eric Dumazet and Daniel
Borkmann.
8) Increate the loopback default MTU to 64K, from Eric Dumazet.
9) Use a per-task rather than per-socket page fragment allocator for
outgoing networking traffic. This benefits processes that have very
many mostly idle sockets, which is quite common.
From Eric Dumazet.
10) Use up to 32K for page fragment allocations, with fallbacks to
smaller sizes when higher order page allocations fail. Benefits are
a) less segments for driver to process b) less calls to page
allocator c) less waste of space.
From Eric Dumazet.
11) Allow GRO to be used on GRE tunnels, from Eric Dumazet.
12) VXLAN device driver, one way to handle VLAN issues such as the
limitation of 4096 VLAN IDs yet still have some level of isolation.
From Stephen Hemminger.
13) As usual there is a large boatload of driver changes, with the scale
perhaps tilted towards the wireless side this time around.
Fix up various fairly trivial conflicts, mostly caused by the user
namespace changes.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1012 commits)
hyperv: Add buffer for extended info after the RNDIS response message.
hyperv: Report actual status in receive completion packet
hyperv: Remove extra allocated space for recv_pkt_list elements
hyperv: Fix page buffer handling in rndis_filter_send_request()
hyperv: Fix the missing return value in rndis_filter_set_packet_filter()
hyperv: Fix the max_xfer_size in RNDIS initialization
vxlan: put UDP socket in correct namespace
vxlan: Depend on CONFIG_INET
sfc: Fix the reported priorities of different filter types
sfc: Remove EFX_FILTER_FLAG_RX_OVERRIDE_IP
sfc: Fix loopback self-test with separate_tx_channels=1
sfc: Fix MCDI structure field lookup
sfc: Add parentheses around use of bitfield macro arguments
sfc: Fix null function pointer in efx_sriov_channel_type
vxlan: virtual extensible lan
igmp: export symbol ip_mc_leave_group
netlink: add attributes to fdb interface
tg3: unconditionally select HWMON support when tg3 is enabled.
Revert "net: ti cpsw ethernet: allow reading phy interface mode from DT"
gre: fix sparse warning
...
It is a frequent mistake to confuse the netlink port identifier with a
process identifier. Try to reduce this confusion by renaming fields
that hold port identifiers portid instead of pid.
I have carefully avoided changing the structures exported to
userspace to avoid changing the userspace API.
I have successfully built an allyesconfig kernel with this change.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
device_write only checks whether the request size is big enough, but it doesn't
check if the size is too big.
At that point, it also tries to allocate as much memory as the user has requested
even if it's too much. This can lead to OOM killer kicking in, or memory corruption
if (count + 1) overflows.
Signed-off-by: Sasha Levin <levinsasha928@gmail.com>
Signed-off-by: David Teigland <teigland@redhat.com>
Since add_sock() always returns a success code - 0, its return
value type should be changed from integer to void.
Signed-off-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David Teigland <teigland@redhat.com>
Once the tcp_create_listen_sock() is returned successfully, we
will invoke add_sock() immediately. In add_sock(), the 'con'
variable is assigned to 'sk_user_data', meanwhile, the 'sock' is
also set to 'con->sock'. So it's unnecessary to do the same thing
in tcp_create_listen_sock().
Signed-off-by: Xue Ying <ying.xue@windriver.com>
Signed-off-by: David Teigland <teigland@redhat.com>
The in_recovery rw_semaphore has always been acquired and
released by different threads by design. To work around
the "BUG: bad unlock balance detected!" messages, adjust
things so the dlm_recoverd thread always does both down_write
and up_write.
Signed-off-by: David Teigland <teigland@redhat.com>
A deadlock sometimes occurs between dlm_controld closing
a lowcomms connection through configfs and dlm_send looking
up the address for a new connection in configfs.
dlm_controld does a configfs rmdir which calls
dlm_lowcomms_close which waits for dlm_send to
cancel work on the workqueues.
The dlm_send workqueue thread has called
tcp_connect_to_sock which calls dlm_nodeid_to_addr
which does a configfs lookup and blocks on a lock
held by dlm_controld in the rmdir path.
The solution here is to save the node addresses within
the lowcomms code so that the lowcomms workqueue does
not need to step through configfs to get a node address.
dlm_controld:
wait_for_completion+0x1d/0x20
__cancel_work_timer+0x1b3/0x1e0
cancel_work_sync+0x10/0x20
dlm_lowcomms_close+0x4c/0xb0 [dlm]
drop_comm+0x22/0x60 [dlm]
client_drop_item+0x26/0x50 [configfs]
configfs_rmdir+0x180/0x230 [configfs]
vfs_rmdir+0xbd/0xf0
do_rmdir+0x103/0x120
sys_rmdir+0x16/0x20
dlm_send:
mutex_lock+0x2b/0x50
get_comm+0x34/0x140 [dlm]
dlm_nodeid_to_addr+0x18/0xd0 [dlm]
tcp_connect_to_sock+0xf4/0x2d0 [dlm]
process_send_sockets+0x1d2/0x260 [dlm]
worker_thread+0x170/0x2a0
Signed-off-by: David Teigland <teigland@redhat.com>
I don't know exactly how, but in some cases, a dir
record is not removed, or a new one is created when
it shouldn't be. The result is that the dir node
lookup returns a master node where the rsb does not
exist. In this case, The master node will repeatedly
return -EBADR for requests, and the lock requests will
be stuck.
Until all possible ways for this to happen can be
eliminated, a simple and effective way to recover from
this situation is for the supposed master node to send
a standard remove message to the dir node when it
receives a request for a resource it has no rsb for.
Signed-off-by: David Teigland <teigland@redhat.com>
The process of rebuilding locks on a new master during
recovery could re-order the locks on the convert queue,
creating an "in place" conversion deadlock that would
not be resolved. Fix this by not considering queue
order when granting conversions after recovery.
Signed-off-by: David Teigland <teigland@redhat.com>
It was possible for a remove message on an old
rsb to be sent after a lookup message on a new
rsb, where the rsbs were for the same resource
name. This could lead to a missing directory
entry for the new rsb.
It is fixed by keeping a copy of the resource
name being removed until after the remove has
been sent. A lookup checks if this in-progress
remove matches the name it is looking up.
Signed-off-by: David Teigland <teigland@redhat.com>
When a large number of resources are being recovered,
a linear search of the recover_list takes a long time.
Use an idr in place of a list.
Signed-off-by: David Teigland <teigland@redhat.com>
Remove the dir hash table (dirtbl), and use
the rsb hash table (rsbtbl) as the resource
directory. It has always been an unnecessary
duplication of information.
This improves efficiency by using a single rsbtbl
lookup in many cases where both rsbtbl and dirtbl
lookups were needed previously.
This eliminates the need to handle cases of rsbtbl
and dirtbl being out of sync.
In many cases there will be memory savings because
the dir hash table no longer exists.
Signed-off-by: David Teigland <teigland@redhat.com>
We aren't allowed to pass NULL pointers to kmem_cache_destroy() so if
both allocations fail, it leads to a NULL dereference.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: David Teigland <teigland@redhat.com>
The "nodir" mode (statically assign master nodes instead
of using the resource directory) has always been highly
experimental, and never seriously used. This commit
fixes a number of problems, making nodir much more usable.
- Major change to recovery: recover all locks and restart
all in-progress operations after recovery. In some
cases it's not possible to know which in-progess locks
to recover, so recover all. (Most require recovery
in nodir mode anyway since rehashing changes most
master nodes.)
- Change the way nodir mode is enabled, from a command
line mount arg passed through gfs2, into a sysfs
file managed by dlm_controld, consistent with the
other config settings.
- Allow recovering MSTCPY locks on an rsb that has not
yet been turned into a master copy.
- Ignore RCOM_LOCK and RCOM_LOCK_REPLY recovery messages
from a previous, aborted recovery cycle. Base this
on the local recovery status not being in the state
where any nodes should be sending LOCK messages for the
current recovery cycle.
- Hold rsb lock around dlm_purge_mstcpy_locks() because it
may run concurrently with dlm_recover_master_copy().
- Maintain highbast on process-copy lkb's (in addition to
the master as is usual), because the lkb can switch
back and forth between being a master and being a
process copy as the master node changes in recovery.
- When recovering MSTCPY locks, flag rsb's that have
non-empty convert or waiting queues for granting
at the end of recovery. (Rename flag from LOCKS_PURGED
to RECOVER_GRANT and similar for the recovery function,
because it's not only resources with purged locks
that need grant a grant attempt.)
- Replace a couple of unnecessary assertion panics with
error messages.
Signed-off-by: David Teigland <teigland@redhat.com>
Change some existing error/debug messages to
collect more useful information, and add
some new error/debug messages to address
recently found problems.
Signed-off-by: David Teigland <teigland@redhat.com>
If the rsb is found in the "keep" tree, but is
not the right type (i.e. not MASTER), we can
return immediately with the result. There's
no point in going on to search the "toss" list
as if we hadn't found it.
Signed-off-by: David Teigland <teigland@redhat.com>
Unify the checking for both types of ignored
rcom messages, and replace the two log_debug
statements with a single, rate limited debug
message.
Signed-off-by: David Teigland <teigland@redhat.com>
An outstanding remote operation (an lkb on the "waiter"
list) could sometimes miss being resent during recovery.
The decision was based on the lkb_nodeid field, which
could have changed during an earlier aborted recovery,
so it no longer represents the actual remote destination.
The lkb_wait_nodeid is always the actual remote node,
so it is the best value to use.
Signed-off-by: David Teigland <teigland@redhat.com>
During lowcomms shutdown, a new connection could possibly
be created, and attempt to use a workqueue that's been
destroyed. Similarly, during startup, a new connection
could attempt to use a workqueue that's not been set up
yet. Add a global variable to indicate when new connections
are allowed.
Based on patch by: Christine Caulfield <ccaulfie@redhat.com>
Reported-by: dann frazier <dann.frazier@canonical.com>
Reviewed-by: dann frazier <dann.frazier@canonical.com>
Signed-off-by: David Teigland <teigland@redhat.com>
This includes one short patch fixing the behavior of
the QUECVT flag, which the gfs2 folks are waiting on.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (GNU/Linux)
iQIcBAABAgAGBQJPlYXZAAoJEDgbc8f8gGmqpzYP/RFkCn8mC5y5cM8lWBk2JQAJ
u7khyqowm3TWxjIpX85n7Uxq1vEX4RxFiRzCeiZj3ZoWE3PEQim8Tqrw8SFs8lcT
y7oYL6TBkgCbM1ROuKDqXRiw8oRAfRud3cqtRvQzxuds3AoaoyYvE6N+to2y9XlR
5DuUBJEtrpKOEdW1ZeXeUmCnvDwrUyEFuIlACoyochzbk6ug1EF926dgSaViE4ZG
OFcGMy8ELNqVYibVcJof2ZfztTvrMcXPIpsJrkK5tIW6w6q+2+eN4Xc2/xMZ4OYc
5AHHXxrqbK1ZABLrqsK/lUQi0Z241kAnqIi33i2nl3mhWSDF3K5CNXmrF9rvGsN7
wEqsfdGOnwFQucF1VU95neo+jYMnom9VGodpvSop7Xy5r+i59MPcfMDfz/I1KqX7
vBDuM5rwisYNfOb6wsfFNcBhkf1ktgo2h2iH5UdIaWfHApF1Lnls7D2j/o7r2uxF
tRd4sPhRt2eIn68XRggbWOVxMfdUKtaW50ZhKzW9osMItYX748O8XfQdk0sQUbD9
ZXbFEfbfsfRgMKhMSyNFcGDh6ePsT/cmZL/zR5VKVEHuprL3hEDPhCui5GT0Sm1G
9sXpLu9p51r0d4OIJpScOFMv8aD64w/mwLJ3r5nrGZz2APK9SwWJOqX82fyqivQc
uvO42yNGkwSGnBjXKiM6
=KDNZ
-----END PGP SIGNATURE-----
Merge tag 'dlm-fixes-3.4' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm
Pull dlm fixes from David Teigland:
"This includes one short patch fixing the behavior of the QUECVT flag,
which the gfs2 folks are waiting on."
* tag 'dlm-fixes-3.4' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm:
dlm: fix QUECVT when convert queue is empty
The QUECVT flag should not prevent conversions from
being granted immediately when the convert queue is
empty.
Signed-off-by: David Teigland <teigland@redhat.com>
Many users of debugfs copy the implementation of default_open() when
they want to support a custom read/write function op. This leads to a
proliferation of the default_open() implementation across the entire
tree.
Now that the common implementation has been consolidated into libfs we
can replace all the users of this function with simple_open().
This replacement was done with the following semantic patch:
<smpl>
@ open @
identifier open_f != simple_open;
identifier i, f;
@@
-int open_f(struct inode *i, struct file *f)
-{
(
-if (i->i_private)
-f->private_data = i->i_private;
|
-f->private_data = i->i_private;
)
-return 0;
-}
@ has_open depends on open @
identifier fops;
identifier open.open_f;
@@
struct file_operations fops = {
...
-.open = open_f,
+.open = simple_open,
...
};
</smpl>
[akpm@linux-foundation.org: checkpatch fixes]
Signed-off-by: Stephen Boyd <sboyd@codeaurora.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Julia Lawall <Julia.Lawall@lip6.fr>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This set includes one trivial fix, and one simple recovery
speed up. Directory recovery can use the standard hash table
to find resources rather than always searching the linear
recovery list.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (GNU/Linux)
iQIcBAABAgAGBQJPahBcAAoJEDgbc8f8gGmqeHEP/i288yZV8NVbIJG7XpX9JjTY
4n4R1CI/qTMDn74GXDkk/OolHc8XTSQwbp02oFlJbPzj71lsWBUWijTAnwxiLIRz
OHQg7eZ2aYL0YmaxAlvM2/6xLNOINmLW/DVwwH4QnpnSB4ymoCHBzyXxrNxgvgRv
KWKUUXj7SDaUmbcK0TFZ39VprTmpw3L+mXIm+Y6kCCS2m4GfISp3Zij4OnxztA/c
brex0R97EoZwrQOvPSRbVA5IaK6BjwfNScXAKsYCOSLsd+tvelD+UgYBdVHBTOmG
godQ5pg8C7SpUB9NQqnLc8r78xpIUcOHQbWRqtwNQ2/6uPI/mWFj+lhpcHRmmzPk
TczdDZVg+pIl9U+SMqiG689KgvnUTciPte0sYqksEbk3NqUMJOWOB7Cv79ZYquaV
Pdmg788Essq7/5BmgeSRlOvS08RvdVfHXqYGOA6/tJ3f0b15M1YuSLjJdwYVWJkS
gVmo4raN44Yh99R/+eNqeI8dvoVfd1pNDAD9VYXk4KdIv3AtKfRZi8XvWZ0o5uQI
EdXTIhiA78ogjxG92cnnzj3+CAIpK4Iv1s53Y0KZgJ7gyExvVHyGp7zl1J7hlFLP
jLuORsL+xMKTGbSWom796QuVn3jL/CGj/OKbnd1D98S0uRuSS6wiy/6ucBFaKmt0
HvT7AVcX2Gh6t/qdTJ9h
=ByvY
-----END PGP SIGNATURE-----
Merge tag 'dlm-3.4' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm
Pull dlm updates for 3.4 from David Teigland:
"This set includes one trivial fix, and one simple recovery speed up.
Directory recovery can use the standard hash table to find resources
rather than always searching the linear recovery list."
* tag 'dlm-3.4' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm:
dlm: last element of dlm_local_addr[] never used
dlm: fix slow rsb search in dir recovery
The last element of dlm_local_addr[DLM_MAX_ADDR_COUNT]
was not used because the loop ended at COUNT - 1.
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: David Teigland <teigland@redhat.com>
avoids allocating a fd that a) propagates to every kernel thread and
usermodehelper b) is not properly released.
References: http://article.gmane.org/gmane.linux.network.drbd/22529
Signed-off-by: Benjamin Poirier <bpoirier@suse.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
The function used to find an rsb during directory
recovery was searching the single linear list of
rsb's. This wasted a lot of time compared to
using the standard hash table to find the rsb.
Signed-off-by: David Teigland <teigland@redhat.com>
These new callbacks notify the dlm user about lock recovery.
GFS2, and possibly others, need to be aware of when the dlm
will be doing lock recovery for a failed lockspace member.
In the past, this coordination has been done between dlm and
file system daemons in userspace, which then direct their
kernel counterparts. These callbacks allow the same
coordination directly, and more simply.
Signed-off-by: David Teigland <teigland@redhat.com>
Slot numbers are assigned to nodes when they join the lockspace.
The slot number chosen is the minimum unused value starting at 1.
Once a node is assigned a slot, that slot number will not change
while the node remains a lockspace member. If the node leaves
and rejoins it can be assigned a new slot number.
A new generation number is also added to a lockspace. It is
set and incremented during each recovery along with the slot
collection/assignment.
The slot numbers will be passed to gfs2 which will use them as
journal id's.
Signed-off-by: David Teigland <teigland@redhat.com>
Put all the calls to recovery barriers in the same function
to clarify where they each happen. Should not change any behavior.
Also modify some recovery debug lines to make them consistent.
Signed-off-by: David Teigland <teigland@redhat.com>
C assignment can handle struct in6_addr copying.
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Change the linked lists to rb_tree's in the rsb
hash table to speed up searches. Slow rsb searches
were having a large impact on gfs2 performance due
to the large number of dlm locks gfs2 uses.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
* 'for-3.1' of git://linux-nfs.org/~bfields/linux:
nfsd: don't break lease on CLAIM_DELEGATE_CUR
locks: rename lock-manager ops
nfsd4: update nfsv4.1 implementation notes
nfsd: turn on reply cache for NFSv4
nfsd4: call nfsd4_release_compoundargs from pc_release
nfsd41: Deny new lock before RECLAIM_COMPLETE done
fs: locks: remove init_once
nfsd41: check the size of request
nfsd41: error out when client sets maxreq_sz or maxresp_sz too small
nfsd4: fix file leak on open_downgrade
nfsd4: remember to put RW access on stateid destruction
NFSD: Added TEST_STATEID operation
NFSD: added FREE_STATEID operation
svcrpc: fix list-corrupting race on nfsd shutdown
rpc: allow autoloading of gss mechanisms
svcauth_unix.c: quiet sparse noise
svcsock.c: include sunrpc.h to quiet sparse noise
nfsd: Remove deprecated nfsctl system call and related code.
NFSD: allow OP_DESTROY_CLIENTID to be only op in COMPOUND
Fix up trivial conflicts in Documentation/feature-removal-schedule.txt
Both the filesystem and the lock manager can associate operations with a
lock. Confusingly, one of them (fl_release_private) actually has the
same name in both operation structures.
It would save some confusion to give the lock-manager ops different
names.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Allow multiple workqueue items (locks with callbacks) to be
processed concurrently. There should be no reason not to
take advantage of this workqueue feature.
Signed-off-by: David Teigland <teigland@redhat.com>
Instead of creating our own kthread (dlm_astd) to deliver
callbacks for all lockspaces, use a per-lockspace workqueue
to deliver the callbacks. This eliminates complications and
slowdowns from many lockspaces sharing the same thread.
Signed-off-by: David Teigland <teigland@redhat.com>
By pre-allocating rsb structs before searching the hash
table, they can be inserted immediately. This avoids
always having to repeat the search when adding the struct
to hash list.
This also adds space to the rsb struct for a max resource
name, so an rsb allocation can be used by any request.
The constant size also allows us to finally use a slab
for the rsb structs.
Signed-off-by: David Teigland <teigland@redhat.com>
This is simpler and quicker than the hash table, and
avoids needing to search the hash list for every new
lkid to check if it's used.
Signed-off-by: David Teigland <teigland@redhat.com>
In fs/dlm/lock.c in the dlm_scan_waiters() function there are 3 small
issues:
1) There's no need to test the return value of the allocation and do a
memset if is succeedes. Just use kzalloc() to obtain zeroed memory.
2) Since kfree() handles NULL pointers gracefully, the test of
'warned' against NULL before the kfree() after the loop is completely
pointless. Remove it.
3) The arguments to kmalloc() (now kzalloc()) were swapped. Thanks to
Dr. David Alan Gilbert for pointing this out.
Signed-off-by: Jesper Juhl <jj@chaosbits.net>
Signed-off-by: David Teigland <teigland@redhat.com>
When the dlm fails to make a network connection to another
node, include the address of the node in the error message.
Signed-off-by: Masatake YAMATO <yamato@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
Allocate dlm hash tables in the vmalloc area to allow a greater
maximum size without restructuring of the hash table code.
Signed-off-by: Bryn M. Reeves <bmr@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
Display all addresses the dlm is using for the local node
from the configfs file config/dlm/<cluster>/comms/<comm>/addr_list
Also make the addr file write only.
Signed-off-by: Masatake YAMATO <yamato@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
* 'trivial' of git://git.kernel.org/pub/scm/linux/kernel/git/mmarek/kbuild-2.6:
gfs2: Drop __TIME__ usage
isdn/diva: Drop __TIME__ usage
atm: Drop __TIME__ usage
dlm: Drop __TIME__ usage
wan/pc300: Drop __TIME__ usage
parport: Drop __TIME__ usage
hdlcdrv: Drop __TIME__ usage
baycom: Drop __TIME__ usage
pmcraid: Drop __DATE__ usage
edac: Drop __DATE__ usage
rio: Drop __DATE__ usage
scsi/wd33c93: Drop __TIME__ usage
scsi/in2000: Drop __TIME__ usage
aacraid: Drop __TIME__ usage
media/cx231xx: Drop __TIME__ usage
media/radio-maxiradio: Drop __TIME__ usage
nozomi: Drop __TIME__ usage
cyclades: Drop __TIME__ usage
The kernel already prints its build timestamp during boot, no need to
repeat it in random drivers and produce different object files each
time.
Cc: Christine Caulfield <ccaulfie@redhat.com>
Cc: David Teigland <teigland@redhat.com>
Cc: cluster-devel@redhat.com
Signed-off-by: Michal Marek <mmarek@suse.cz>
Allow processes blocked on plock requests to be interrupted
when they are killed. This leaves the problem of cleaning
up the lock state in userspace. This has three parts:
1. Add a flag to unlock operations sent to userspace
indicating the file is being closed. Userspace will
then look for and clear any waiting plock operations that
were abandoned by an interrupted process.
2. Queue an unlock-close operation (like in 1) to clean up
userspace from an interrupted plock request. This is needed
because the vfs will not send a cleanup-unlock if it sees no
locks on the file, which it won't if the interrupted operation
was the only one.
3. Do not use replies from userspace for unlock-close operations
because they are unnecessary (they are just cleaning up for the
process which did not make an unlock call). This also simplifies
the new unlock-close generated from point 2.
Signed-off-by: David Teigland <teigland@redhat.com>
kmalloc a stub message struct during recovery instead of sharing the
struct in the lockspace. This leaves the lockspace stub_ms only for
faking downconvert replies, where it is never modified and sharing
is not a problem.
Also improve the debug messages in the same recovery function.
Signed-off-by: David Teigland <teigland@redhat.com>
Add an option (disabled by default) to print a warning message
when a lock has been waiting a configurable amount of time for
a reply message from another node. This is mainly for debugging.
Signed-off-by: David Teigland <teigland@redhat.com>
recalc_sigpending() is called within sigprocmask(), so there is no
need call it again after sigprocmask() has returned.
Signed-off-by: Matt Fleming <matt.fleming@linux.intel.com>
Signed-off-by: David Teigland <teigland@redhat.com>
Make all three hash tables a consistent size of 1024
rather than 1024, 512, 256. All three tables, for
resources, locks, and lock dir entries, will generally
be filled to the same order of magnitude.
Signed-off-by: David Teigland <teigland@redhat.com>
Change how callbacks are recorded for locks. Previously, information
about multiple callbacks was combined into a couple of variables that
indicated what the end result should be. In some situations, we
could not tell from this combined state what the exact sequence of
callbacks were, and would end up either delivering the callbacks in
the wrong order, or suppress redundant callbacks incorrectly. This
new approach records all the data for each callback, leaving no
uncertainty about what needs to be delivered.
Signed-off-by: David Teigland <teigland@redhat.com>
The recent commit to use cmwq for send and recv threads
dcce240ead introduced problems,
apparently due to multiple workqueue threads. Single threads
make the problems go away, so return to that until we fully
understand the concurrency issues with multiple threads.
Signed-off-by: David Teigland <teigland@redhat.com>
This patch fixes the following kconfig error after changing
CONFIGFS_FS -> select SYSFS:
fs/sysfs/Kconfig:1:error: recursive dependency detected!
fs/sysfs/Kconfig:1: symbol SYSFS is selected by CONFIGFS_FS
fs/configfs/Kconfig:1: symbol CONFIGFS_FS is selected by DLM
fs/dlm/Kconfig:1: symbol DLM depends on SYSFS
Signed-off-by: Nicholas A. Bellinger <nab@linux-iscsi.org>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Randy Dunlap <randy.dunlap@oracle.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: James Bottomley <James.Bottomley@suse.de>
The create_workqueue() returns NULL if failed rather than ERR_PTR().
Fix error checking and remove unnecessary variable 'error'.
Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: David Teigland <teigland@redhat.com>
Calling cond_resched() after every send can unnecessarily
degrade performance. Go back to an old method of scheduling
after 25 messages.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
So far as I can tell, there is no reason to use a single-threaded
send workqueue for dlm, since it may need to send to several sockets
concurrently. Both workqueues are set to WQ_MEM_RECLAIM to avoid
any possible deadlocks, WQ_HIGHPRI since locking traffic is highly
latency sensitive (and to avoid a priority inversion wrt GFS2's
glock_workqueue) and WQ_FREEZABLE just in case someone needs to do
that (even though with current cluster infrastructure, it doesn't
make sense as the node will most likely land up ejected from the
cluster) in the future.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: David Teigland <teigland@redhat.com>
In the normal regime where an application uses non-blocking I/O
writes on a socket, they will handle -EAGAIN and use poll() to
wait for send space.
They don't actually sleep on the socket I/O write.
But kernel level RPC layers that do socket I/O operations directly
and key off of -EAGAIN on the write() to "try again later" don't
use poll(), they instead have their own sleeping mechanism and
rely upon ->sk_write_space() to trigger the wakeup.
So they do effectively sleep on the write(), but this mechanism
alone does not let the socket layers know what's going on.
Therefore they must emulate what would have happened, otherwise
TCP cannot possibly see that the connection is application window
size limited.
Handle this, therefore, like SUNRPC by setting SOCK_NOSPACE and
bumping the ->sk_write_count as needed when we hit the send buffer
limits.
This should make TCP send buffer size auto-tuning and the
->sk_write_space() callback invocations actually happen.
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: David Teigland <teigland@redhat.com>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/dlm:
dlm: Fix dlm lock status block comment in dlm.h
dlm: Don't send callback to node making lock request when "try 1cb" fails
All file_operations should get a .llseek operation so we can make
nonseekable_open the default for future file operations without a
.llseek pointer.
The three cases that we can automatically detect are no_llseek, seq_lseek
and default_llseek. For cases where we can we can automatically prove that
the file offset is always ignored, we use noop_llseek, which maintains
the current behavior of not returning an error from a seek.
New drivers should normally not use noop_llseek but instead use no_llseek
and call nonseekable_open at open time. Existing drivers can be converted
to do the same when the maintainer knows for certain that no user code
relies on calling seek on the device file.
The generated code is often incorrectly indented and right now contains
comments that clarify for each added line why a specific variant was
chosen. In the version that gets submitted upstream, the comments will
be gone and I will manually fix the indentation, because there does not
seem to be a way to do that using coccinelle.
Some amount of new code is currently sitting in linux-next that should get
the same modifications, which I will do at the end of the merge window.
Many thanks to Julia Lawall for helping me learn to write a semantic
patch that does all this.
===== begin semantic patch =====
// This adds an llseek= method to all file operations,
// as a preparation for making no_llseek the default.
//
// The rules are
// - use no_llseek explicitly if we do nonseekable_open
// - use seq_lseek for sequential files
// - use default_llseek if we know we access f_pos
// - use noop_llseek if we know we don't access f_pos,
// but we still want to allow users to call lseek
//
@ open1 exists @
identifier nested_open;
@@
nested_open(...)
{
<+...
nonseekable_open(...)
...+>
}
@ open exists@
identifier open_f;
identifier i, f;
identifier open1.nested_open;
@@
int open_f(struct inode *i, struct file *f)
{
<+...
(
nonseekable_open(...)
|
nested_open(...)
)
...+>
}
@ read disable optional_qualifier exists @
identifier read_f;
identifier f, p, s, off;
type ssize_t, size_t, loff_t;
expression E;
identifier func;
@@
ssize_t read_f(struct file *f, char *p, size_t s, loff_t *off)
{
<+...
(
*off = E
|
*off += E
|
func(..., off, ...)
|
E = *off
)
...+>
}
@ read_no_fpos disable optional_qualifier exists @
identifier read_f;
identifier f, p, s, off;
type ssize_t, size_t, loff_t;
@@
ssize_t read_f(struct file *f, char *p, size_t s, loff_t *off)
{
... when != off
}
@ write @
identifier write_f;
identifier f, p, s, off;
type ssize_t, size_t, loff_t;
expression E;
identifier func;
@@
ssize_t write_f(struct file *f, const char *p, size_t s, loff_t *off)
{
<+...
(
*off = E
|
*off += E
|
func(..., off, ...)
|
E = *off
)
...+>
}
@ write_no_fpos @
identifier write_f;
identifier f, p, s, off;
type ssize_t, size_t, loff_t;
@@
ssize_t write_f(struct file *f, const char *p, size_t s, loff_t *off)
{
... when != off
}
@ fops0 @
identifier fops;
@@
struct file_operations fops = {
...
};
@ has_llseek depends on fops0 @
identifier fops0.fops;
identifier llseek_f;
@@
struct file_operations fops = {
...
.llseek = llseek_f,
...
};
@ has_read depends on fops0 @
identifier fops0.fops;
identifier read_f;
@@
struct file_operations fops = {
...
.read = read_f,
...
};
@ has_write depends on fops0 @
identifier fops0.fops;
identifier write_f;
@@
struct file_operations fops = {
...
.write = write_f,
...
};
@ has_open depends on fops0 @
identifier fops0.fops;
identifier open_f;
@@
struct file_operations fops = {
...
.open = open_f,
...
};
// use no_llseek if we call nonseekable_open
////////////////////////////////////////////
@ nonseekable1 depends on !has_llseek && has_open @
identifier fops0.fops;
identifier nso ~= "nonseekable_open";
@@
struct file_operations fops = {
... .open = nso, ...
+.llseek = no_llseek, /* nonseekable */
};
@ nonseekable2 depends on !has_llseek @
identifier fops0.fops;
identifier open.open_f;
@@
struct file_operations fops = {
... .open = open_f, ...
+.llseek = no_llseek, /* open uses nonseekable */
};
// use seq_lseek for sequential files
/////////////////////////////////////
@ seq depends on !has_llseek @
identifier fops0.fops;
identifier sr ~= "seq_read";
@@
struct file_operations fops = {
... .read = sr, ...
+.llseek = seq_lseek, /* we have seq_read */
};
// use default_llseek if there is a readdir
///////////////////////////////////////////
@ fops1 depends on !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
identifier fops0.fops;
identifier readdir_e;
@@
// any other fop is used that changes pos
struct file_operations fops = {
... .readdir = readdir_e, ...
+.llseek = default_llseek, /* readdir is present */
};
// use default_llseek if at least one of read/write touches f_pos
/////////////////////////////////////////////////////////////////
@ fops2 depends on !fops1 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
identifier fops0.fops;
identifier read.read_f;
@@
// read fops use offset
struct file_operations fops = {
... .read = read_f, ...
+.llseek = default_llseek, /* read accesses f_pos */
};
@ fops3 depends on !fops1 && !fops2 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
identifier fops0.fops;
identifier write.write_f;
@@
// write fops use offset
struct file_operations fops = {
... .write = write_f, ...
+ .llseek = default_llseek, /* write accesses f_pos */
};
// Use noop_llseek if neither read nor write accesses f_pos
///////////////////////////////////////////////////////////
@ fops4 depends on !fops1 && !fops2 && !fops3 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
identifier fops0.fops;
identifier read_no_fpos.read_f;
identifier write_no_fpos.write_f;
@@
// write fops use offset
struct file_operations fops = {
...
.write = write_f,
.read = read_f,
...
+.llseek = noop_llseek, /* read and write both use no f_pos */
};
@ depends on has_write && !has_read && !fops1 && !fops2 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
identifier fops0.fops;
identifier write_no_fpos.write_f;
@@
struct file_operations fops = {
... .write = write_f, ...
+.llseek = noop_llseek, /* write uses no f_pos */
};
@ depends on has_read && !has_write && !fops1 && !fops2 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
identifier fops0.fops;
identifier read_no_fpos.read_f;
@@
struct file_operations fops = {
... .read = read_f, ...
+.llseek = noop_llseek, /* read uses no f_pos */
};
@ depends on !has_read && !has_write && !fops1 && !fops2 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
identifier fops0.fops;
@@
struct file_operations fops = {
...
+.llseek = noop_llseek, /* no read or write fn */
};
===== End semantic patch =====
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: Julia Lawall <julia@diku.dk>
Cc: Christoph Hellwig <hch@infradead.org>
When converting a lock, an lkb is in the granted state and also being used
to request a new state. In the case that the conversion was a "try 1cb"
type which has failed, and if the new state was incompatible with the old
state, a callback was being generated to the requesting node. This is
incorrect as callbacks should only be sent to all the other nodes holding
blocking locks. The requesting node should receive the normal (failed)
response to its "try 1cb" conversion request only.
This was discovered while debugging a performance problem on GFS2, however
this fix also speeds up GFS as well. In the GFS2 case the performance gain
is over 10x for cases of write activity to an inode whose glock is cached
on another, idle (wrt that glock) node.
(comment added, dct)
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Tested-by: Abhijith Das <adas@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
hlist_for_each_entry binds its first argument to a non-null value, and thus
any null test on the value of that argument is superfluous.
The semantic patch that makes this change is as follows:
(http://coccinelle.lip6.fr/)
// <smpl>
@@
iterator I;
expression x,E,E1,E2;
statement S,S1,S2;
@@
I(x,...) { <...
- (x != NULL) &&
E
...> }
// </smpl>
Signed-off-by: Julia Lawall <julia@diku.dk>
Signed-off-by: David Teigland <teigland@redhat.com>
Commit 7fe2b3190b fixed possible
misordering of completion asts (casts) and blocking asts (basts)
for kernel locks. This patch does the same for locks taken by
user space applications.
Signed-off-by: David Teigland <teigland@redhat.com>
Smatch complains because "lkb" is never NULL. Looking at it, the original
code actually adds the new element to the end of the list fine, so we can
just get rid of the if condition. This code is four years old and no one
has complained so it must work.
Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: David Teigland <teigland@redhat.com>
percpu.h is included by sched.h and module.h and thus ends up being
included when building most .c files. percpu.h includes slab.h which
in turn includes gfp.h making everything defined by the two files
universally available and complicating inclusion dependencies.
percpu.h -> slab.h dependency is about to be removed. Prepare for
this change by updating users of gfp and slab facilities include those
headers directly instead of assuming availability. As this conversion
needs to touch large number of source files, the following script is
used as the basis of conversion.
http://userweb.kernel.org/~tj/misc/slabh-sweep.py
The script does the followings.
* Scan files for gfp and slab usages and update includes such that
only the necessary includes are there. ie. if only gfp is used,
gfp.h, if slab is used, slab.h.
* When the script inserts a new include, it looks at the include
blocks and try to put the new include such that its order conforms
to its surrounding. It's put in the include block which contains
core kernel includes, in the same order that the rest are ordered -
alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
doesn't seem to be any matching order.
* If the script can't find a place to put a new include (mostly
because the file doesn't have fitting include block), it prints out
an error message indicating which .h file needs to be added to the
file.
The conversion was done in the following steps.
1. The initial automatic conversion of all .c files updated slightly
over 4000 files, deleting around 700 includes and adding ~480 gfp.h
and ~3000 slab.h inclusions. The script emitted errors for ~400
files.
2. Each error was manually checked. Some didn't need the inclusion,
some needed manual addition while adding it to implementation .h or
embedding .c file was more appropriate for others. This step added
inclusions to around 150 files.
3. The script was run again and the output was compared to the edits
from #2 to make sure no file was left behind.
4. Several build tests were done and a couple of problems were fixed.
e.g. lib/decompress_*.c used malloc/free() wrappers around slab
APIs requiring slab.h to be added manually.
5. The script was run on all .h files but without automatically
editing them as sprinkling gfp.h and slab.h inclusions around .h
files could easily lead to inclusion dependency hell. Most gfp.h
inclusion directives were ignored as stuff from gfp.h was usually
wildly available and often used in preprocessor macros. Each
slab.h inclusion directive was examined and added manually as
necessary.
6. percpu.h was updated not to include slab.h.
7. Build test were done on the following configurations and failures
were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
distributed build env didn't work with gcov compiles) and a few
more options had to be turned off depending on archs to make things
build (like ipr on powerpc/64 which failed due to missing writeq).
* x86 and x86_64 UP and SMP allmodconfig and a custom test config.
* powerpc and powerpc64 SMP allmodconfig
* sparc and sparc64 SMP allmodconfig
* ia64 SMP allmodconfig
* s390 SMP allmodconfig
* alpha SMP allmodconfig
* um on x86_64 SMP allmodconfig
8. percpu.h modifications were reverted so that it could be applied as
a separate patch and serve as bisection point.
Given the fact that I had only a couple of failures from tests on step
6, I'm fairly confident about the coverage of this conversion patch.
If there is a breakage, it's likely to be something in one of the arch
headers which should be easily discoverable easily on most builds of
the specific arch.
Signed-off-by: Tejun Heo <tj@kernel.org>
Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (56 commits)
doc: fix typo in comment explaining rb_tree usage
Remove fs/ntfs/ChangeLog
doc: fix console doc typo
doc: cpuset: Update the cpuset flag file
Fix of spelling in arch/sparc/kernel/leon_kernel.c no longer needed
Remove drivers/parport/ChangeLog
Remove drivers/char/ChangeLog
doc: typo - Table 1-2 should refer to "status", not "statm"
tree-wide: fix typos "ass?o[sc]iac?te" -> "associate" in comments
No need to patch AMD-provided drivers/gpu/drm/radeon/atombios.h
devres/irq: Fix devm_irq_match comment
Remove reference to kthread_create_on_cpu
tree-wide: Assorted spelling fixes
tree-wide: fix 'lenght' typo in comments and code
drm/kms: fix spelling in error message
doc: capitalization and other minor fixes in pnp doc
devres: typo fix s/dev/devm/
Remove redundant trailing semicolons from macros
fix typo "definetly" -> "definitely" in comment
tree-wide: s/widht/width/g typo in comments
...
Fix trivial conflict in Documentation/laptops/00-INDEX
Constify struct sysfs_ops.
This is part of the ops structure constification
effort started by Arjan van de Ven et al.
Benefits of this constification:
* prevents modification of data that is shared
(referenced) by many other structure instances
at runtime
* detects/prevents accidental (but not intentional)
modification attempts on archs that enforce
read-only kernel data at runtime
* potentially better optimized code as the compiler
can assume that the const data cannot be changed
* the compiler/linker move const data into .rodata
and therefore exclude them from false sharing
Signed-off-by: Emese Revfy <re.emese@gmail.com>
Acked-by: David Teigland <teigland@redhat.com>
Acked-by: Matt Domsch <Matt_Domsch@dell.com>
Acked-by: Maciej Sosnowski <maciej.sosnowski@intel.com>
Acked-by: Hans J. Koch <hjk@linutronix.de>
Acked-by: Pekka Enberg <penberg@cs.helsinki.fi>
Acked-by: Jens Axboe <jens.axboe@oracle.com>
Acked-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
The bast mode that appears in the debugfs output should be
useful on both master and process nodes. lkb_highbast is
currently printed, and is only useful on the master node.
lkb_bastmode is only useful on the process node. This
patch sets lkb_bastmode on the master node as well, and
uses that value in the debugfs print.
Signed-off-by: David Teigland <teigland@redhat.com>
Although it is possible to get this information from the path,
its much easier to provide the lockspace as a seperate env
variable.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
When the lock master processes a successful operation (request,
convert, cancel, or unlock), it will process the effects of the
change before sending the reply for the operation. The "effects"
of the operation are:
- blocking callbacks (basts) for any newly granted locks
- waiting or converting locks that can now be granted
The cast is queued on the local node when the reply from the lock
master is received. This means that a lock holder can receive a
bast for a lock mode that is doesn't yet know has been granted.
Signed-off-by: David Teigland <teigland@redhat.com>
When both blocking and completion callbacks are queued for lock,
the dlm would always deliver the completion callback (cast) first.
In some cases the blocking callback (bast) is queued before the
cast, though, and should be delivered first. This patch keeps
track of the order in which they were queued and delivers them
in that order.
This patch also keeps track of the granted mode in the last cast
and eliminates the following bast if the bast mode is compatible
with the preceding cast mode. This happens when a remotely mastered
lock is demoted, e.g. EX->NL, in which case the local node queues
a cast immediately after sending the demote message. In this way
a cast can be queued for a mode, e.g. NL, that makes an in-transit
bast extraneous.
Signed-off-by: David Teigland <teigland@redhat.com>
Some comments misspell "truly"; this fixes them. No code changes.
Signed-off-by: Adam Buchbinder <adam.buchbinder@gmail.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Replace all GFP_KERNEL and ls_allocation with GFP_NOFS.
ls_allocation would be GFP_KERNEL for userland lockspaces
and GFP_NOFS for file system lockspaces.
It was discovered that any lockspaces on the system can
affect all others by triggering memory reclaim in the
file system which could in turn call back into the dlm
to acquire locks, deadlocking dlm threads that were
shared by all lockspaces, like dlm_recv.
Signed-off-by: David Teigland <teigland@redhat.com>
The code to set up sctp sockets was not using the sockfd_lookup()
and sockfd_put() routines to translate an fd to a socket. The
direct fget and fput calls were resulting in error messages from
alloc_fd().
Also clean up two log messages and remove a third, related to
setting up sctp associations.
Signed-off-by: David Teigland <teigland@redhat.com>
The recently added dlm_lowcomms_connect_node() from
391fbdc5d5 does not work
when using SCTP instead of TCP. The sctp connection code
has nothing to do without data to send. Check for no data
in the sctp connection code and do nothing instead of
triggering a BUG. Also have connect_node() do nothing
when the protocol is sctp.
Signed-off-by: David Teigland <teigland@redhat.com>
Make all seq_operations structs const, to help mitigate against
revectoring user-triggerable function pointers.
This is derived from the grsecurity patch, although generated from scratch
because it's simpler than extracting the changes from there.
Signed-off-by: James Morris <jmorris@namei.org>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Acked-by: Casey Schaufler <casey@schaufler-ca.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/dlm:
dlm: use kernel_sendpage
dlm: fix connection close handling
dlm: fix double-release of socket in error exit path
Using kernel_sendpage() is cleaner and safer than following
sock->ops ourselves.
Signed-off-by: Paolo Bonzini <bonzini@gnu.org>
Signed-off-by: David Teigland <teigland@redhat.com>
Closing a connection to a node can create problems if there are
outstanding messages for that node. The problems include dlm_send
spinning attempting to reconnect, or BUG from tcp_connect_to_sock()
attempting to use a partially closed connection.
To cleanly close a connection, we now first attempt to send any pending
messages, cancel any remaining workqueue work, and flag the connection
as closed to avoid reconnect attempts.
Signed-off-by: Lars Marowsky-Bree <lmb@suse.de>
Signed-off-by: Christine Caulfield <ccaulfie@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
The last correction to the tcp_connect_to_sock error exit path,
commit a89d63a159, can free an already
freed socket, due to collision with a previous (incomplete) attempt
to fix the same issue, commit 311f6fc77c.
Signed-off-by: Casey Dahlin <cdahlin@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
In the tcp_connect_to_sock() error exit path, the socket
allocated at the top of the function was not being freed.
Signed-off-by: Casey Dahlin <cdahlin@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
This makes generic netlink network namespace aware. No
generic netlink families except for the controller family
are made namespace aware, they need to be checked one by
one and then set the family->netnsok member to true.
A new function genlmsg_multicast_netns() is introduced to
allow sending a multicast message in a given namespace,
for example when it applies to an object that lives in
that namespace, a new function genlmsg_multicast_allns()
to send a message to all network namespaces (for objects
that do not have an associated netns).
The function genlmsg_multicast() is changed to multicast
the message in just init_net, which is currently correct
for all generic netlink families since they only work in
init_net right now. Some will later want to work in all
net namespaces because they do not care about the netns
at all -- those will have to be converted to use one of
the new functions genlmsg_multicast_allns() or
genlmsg_multicast_netns() whenever they are made netns
aware in some way.
After this patch families can easily decide whether or
not they should be available in all net namespaces. Many
genl families us it for objects not related to networking
and should therefore be available in all namespaces, but
that will have to be done on a per family basis.
Note that this doesn't touch on the checkpoint/restart
problem where network namespaces could be used, genl
families and multicast groups are numbered globally and
I see no easy way of changing that, especially since it
must be possible to multicast to all network namespaces
for those families that do not care about netns.
Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fix a regression from the original addition of nfs lock support
586759f03e. When a synchronous
(non-nfs) plock completes, the waiting thread will wake up and
free the op struct. This races with the user thread in
dev_write() which goes on to read the op's callback field to
check if the lock is async and needs a callback. This check
can happen on the freed op. The fix is to note the callback
value before the op can be freed.
Signed-off-by: David Teigland <teigland@redhat.com>
CC [M] fs/dlm/lock.o
fs/dlm/lock.c: In function ‘find_rsb’:
fs/dlm/lock.c:438: warning: ‘r’ may be used uninitialized in this function
Since r is used on the error path to set r_ret, set it to NULL.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
Change some GFP_KERNEL allocations to use either GFP_NOFS or
ls_allocation (when available) which the fs sets to GFP_NOFS.
The point is to prevent allocations from going back into the
cluster fs in places where that might lead to deadlock.
Signed-off-by: David Teigland <teigland@redhat.com>
Make network connections to other nodes earlier, in the context of
dlm_recoverd. This avoids connecting to nodes from dlm_send where we
try to avoid allocations which could possibly deadlock if memory reclaim
goes into the cluster fs which may try to do a dlm operation.
Signed-off-by: Christine Caulfield <ccaulfie@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
When a lockspace was joined multiple times, the global dlm
use count was incremented when it should not have been. This
caused the global dlm threads to not be stopped when all
lockspaces were eventually be removed.
Signed-off-by: David Teigland <teigland@redhat.com>
Using offsetof() to calculate name length does not work because
it does not produce consistent results with with structure packing.
This caused memcpy to corrupt memory by copying 4 extra bytes off
the end of the buffer on 64 bit kernels with 32 bit userspace
(the only case where this 32/64 compat code is used).
The fix is to calculate name length directly from the start instead
of trying to derive it later using count and offsetof.
Signed-off-by: David Teigland <teigland@redhat.com>
Return immediately from dlm_unlock(CANCEL) if the lock is
granted and not being converted; there's nothing to cancel.
Signed-off-by: David Teigland <teigland@redhat.com>
When a conversion completes successfully and finds that a cancel
of the convert is still in progress (which is now a moot point),
preemptively clear the state associated with outstanding cancel.
That state could cause a subsequent conversion to be ignored.
Also, improve the consistency and content of error and debug
messages in this area.
Signed-off-by: David Teigland <teigland@redhat.com>
Integer nodeids can be too large for the idr code; use a hash
table instead.
Signed-off-by: Christine Caulfield <ccaulfie@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
The ls_dirtbl[].lock was an rwlock, but since it was only used in write
mode a spinlock will suffice.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
dlm_posix_get fills out the relevant fields in the file_lock before
returning when there is a lock conflict, but doesn't clean out any of
the other fields in the file_lock.
When nfsd does a NFSv4 lockt call, it sets the fl_lmops to
nfsd_posix_mng_ops before calling the lower fs. When the lock comes back
after testing a lock on GFS2, it still has that field set. This confuses
nfsd into thinking that the file_lock is a nfsd4 lock.
Fix this by making DLM reinitialize the file_lock before copying the
fields from the conflicting lock.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
We should use the original copy of the file_lock, fl, instead
of the copy, flc in the lockd notify callback. The range in flc has
been modified by posix_lock_file(), so it will not match a copy of the
lock in lockd.
Signed-off-by: David Teigland <teigland@redhat.com>
The rwlock is almost always used in write mode, so there's no reason
to not use a spinlock instead.
Signed-off-by: David Teigland <teigland@redhat.com>
The old code would leak iterators and leave reference counts on
rsbs because it was ignoring the "stop" seq callback. The code
followed an example that used the seq operations differently.
This new code is based on actually understanding how the seq
operations work. It also improves things by saving the hash bucket
in the position to avoid cycling through completed buckets in start.
Siged-off-by: Davd Teigland <teigland@redhat.com>
fs/dlm/ast.c: In function 'dlm_astd':
fs/dlm/ast.c:64: warning: 'bastmode' may be used uninitialized in this function
Cleans code up.
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: David Teigland <teigland@redhat.com>
The new debugfs entry dumps all rsb and lkb structures, and includes
a lot more information than has been available before. This includes
the new timestamps added by a previous patch for debugging callback
issues.
Signed-off-by: David Teigland <teigland@redhat.com>
Record the time the latest blocking callback was queued for
a lock. This will be used for debugging in combination with
lock queue timestamp changes in the previous patch.
Signed-off-by: David Teigland <teigland@redhat.com>
Use ktime instead of jiffies for timestamping lkb's. Also stamp the
time on every lkb whenever it's added to a resource queue, instead of
just stamping locks subject to timeouts. This will allow us to use
timestamps more widely for debugging all locks.
Signed-off-by: David Teigland <teigland@redhat.com>
The lkb bastmode value is set in the context of processing the
lock, and read by the dlm_astd thread. Because it's accessed
in these two separate contexts, the writing/reading ought to
be done under a lock. This is simple to do by setting it and
reading it when the lkb is added to and removed from dlm_astd's
callback list which is properly locked.
Signed-off-by: David Teigland <teigland@redhat.com>
Just before delivering a blocking callback (bast), the dlm_astd
thread checks again that the granted mode of the lkb actually
blocks the mode requested by the bast. The idea behind this was
originally that the granted mode may have changed since the bast
was queued, making the callback now unnecessary. Reasons for
removing this extra check are:
- dlm_astd doesn't lock the rsb before reading the lkb grmode, so
it's not technically safe (this removes the long standing FIXME)
- after running some tests, it doesn't appear the check ever actually
eliminates a bast
- delivering an unnecessary blocking callback isn't a bad thing and
can happen anyway
Signed-off-by: David Teigland <teigland@redhat.com>
This is a one-liner to use cond_resched() rather than schedule()
in the ast delivery loop. It should not be necessary to schedule
every time, so this will save some cpu time while continuing to
allow scheduling when required.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
The pages used in lowcomms are not highmem, so kmap is not necessary.
Cc: Christine Caulfield <ccaulfie@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
Use ls_allocation for memory allocations, which a cluster fs sets to
GFP_NOFS. Use GFP_NOFS for allocations when no lockspace struct is
available. Taking dlm locks needs to avoid calling back into the
cluster fs because write-out can require taking dlm locks.
Cc: Christine Caulfield <ccaulfie@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
this warning:
fs/dlm/netlink.c: In function ‘dlm_timeout_warn’:
fs/dlm/netlink.c:131: warning: ‘send_skb’ may be used uninitialized in this function
triggers because GCC does not recognize the (correct) error flow
between prepare_data() and send_skb.
Annotate it.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fixes a regression from commit 0f8e0d9a31,
"dlm: allow multiple lockspace creates".
An extraneous 'else' slipped into a code fragment being moved from
release_lockspace() to dlm_release_lockspace(). The result of the
unwanted 'else' is that dlm threads and structures are not stopped
and cleaned up when the final dlm lockspace is removed. Trying to
create a new lockspace again afterward will fail with
"kmem_cache_create: duplicate cache dlm_conn" because the cache
was not previously destroyed.
Signed-off-by: David Teigland <teigland@redhat.com>
sparc32:
fs/dlm/config.c:397: error: expected identifier or '(' before '{' token
fs/dlm/config.c: In function 'drop_node':
fs/dlm/config.c:589: warning: initialization from incompatible pointer type
fs/dlm/config.c:589: warning: initialization from incompatible pointer type
fs/dlm/config.c: In function 'release_node':
fs/dlm/config.c:601: warning: initialization from incompatible pointer type
fs/dlm/config.c:601: warning: initialization from incompatible pointer type
fs/dlm/config.c: In function 'show_node':
fs/dlm/config.c:717: warning: initialization from incompatible pointer type
fs/dlm/config.c:717: warning: initialization from incompatible pointer type
fs/dlm/config.c: In function 'store_node':
fs/dlm/config.c:726: warning: initialization from incompatible pointer type
fs/dlm/config.c:726: warning: initialization from incompatible pointer type
Cc: Christine Caulfield <ccaulfie@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: David Teigland <teigland@redhat.com>
Compare only the addr and port fields of sockaddr structures.
Fixes a problem with ipv6 where sin6_scope_id does not match.
Signed-off-by: David Teigland <teigland@redhat.com>
If dlm_controld (the userspace daemon that controls the setup and
recovery of the dlm) fails, the kernel should shut down the lockspaces
in the kernel rather than leaving them running. This is detected by
having dlm_controld hold a misc device open while running, and if
the kernel detects a close while the daemon is still needed, it stops
the lockspaces in the kernel.
Knowing that the userspace daemon isn't running also allows the
lockspace create/remove routines to avoid waiting on the daemon
for join/leave operations.
Signed-off-by: David Teigland <teigland@redhat.com>
Add a count for lockspace create and release so that create can
be called multiple times to use the lockspace from different places.
Also add the new flag DLM_LSFL_NEWEXCL to create a lockspace with
the previous behavior of returning -EEXIST if the lockspace already
exists.
Signed-off-by: David Teigland <teigland@redhat.com>
Add a dlm_ prefix to the struct names in config.c. This resolves a
conflict with struct node in particular, when include/linux/node.h
happens to be included.
Reported-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: David Teigland <teigland@redhat.com>
A couple of unlikely error conditions were missing a kfree on the error
exit path.
Reported-by: Juha Leppanen <juha_motorsportcom@luukku.com>
Signed-off-by: David Teigland <teigland@redhat.com>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/dlm:
dlm: fix uninitialized variable for search_rsb_list callers
dlm: release socket on error
dlm: fix basts for granted CW waiting PR/CW
dlm: check for null in device_write
Use a special error value FILE_LOCK_DEFERRED to mean that a locking
operation returned asynchronously. This is returned by
posix_lock_file() for sleeping locks to mean that the lock has been
queued on the block list, and will be woken up when it might become
available and needs to be retried (either fl_lmops->fl_notify() is
called or fl_wait is woken up).
f_op->lock() to mean either the above, or that the filesystem will
call back with fl_lmops->fl_grant() when the result of the locking
operation is known. The filesystem can do this for sleeping as well
as non-sleeping locks.
This is to make sure, that return values of -EAGAIN and -EINPROGRESS by
filesystems are not mistaken to mean an asynchronous locking.
This also makes error handling in fs/locks.c and lockd/svclock.c slightly
cleaner.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Cc: Matthew Wilcox <matthew@wil.cx>
Cc: David Teigland <teigland@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The configfs operations ->make_item() and ->make_group() currently
return a new item/group. A return of NULL signifies an error. Because
of this, -ENOMEM is the only return code bubbled up the stack.
Multiple folks have requested the ability to return specific error codes
when these operations fail. This patch adds that ability by changing the
->make_item/group() ops to return ERR_PTR() values. These errors are
bubbled up appropriately. NULL returns are changed to -ENOMEM for
compatibility.
Also updated are the in-kernel users of configfs.
This is a rework of reverted commit 11c3b79218.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
* 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mfasheh/ocfs2:
[PATCH] ocfs2: fix oops in mmap_truncate testing
configfs: call drop_link() to cleanup after create_link() failure
configfs: Allow ->make_item() and ->make_group() to return detailed errors.
configfs: Fix failing mkdir() making racing rmdir() fail
configfs: Fix deadlock with racing rmdir() and rename()
configfs: Make configfs_new_dirent() return error code instead of NULL
configfs: Protect configfs_dirent s_links list mutations
configfs: Introduce configfs_dirent_lock
ocfs2: Don't snprintf() without a format.
ocfs2: Fix CONFIG_OCFS2_DEBUG_FS #ifdefs
ocfs2/net: Silence build warnings on sparc64
ocfs2: Handle error during journal load
ocfs2: Silence an error message in ocfs2_file_aio_read()
ocfs2: use simple_read_from_buffer()
ocfs2: fix printk format warnings with OCFS2_FS_STATS=n
[PATCH 2/2] ocfs2: Instrument fs cluster locks
[PATCH 1/2] ocfs2: Add CONFIG_OCFS2_FS_STATS config option
The configfs operations ->make_item() and ->make_group() currently
return a new item/group. A return of NULL signifies an error. Because
of this, -ENOMEM is the only return code bubbled up the stack.
Multiple folks have requested the ability to return specific error codes
when these operations fail. This patch adds that ability by changing the
->make_item/group() ops to return an int.
Also updated are the in-kernel users of configfs.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
gcc 4.3.0 correctly emits the following warning.
search_rsb_list does not *r_ret if no dlm_rsb is found
and _search_rsb may pass the uninitialized value upstream
on the error path when both calls to search_rsb_list
return non-zero error.
The fix sets *r_ret to NULL on search_rsb_list's not-found path.
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: David Teigland <teigland@redhat.com>
It seems that `sock' allocated by sock_create_kern in
tcp_connect_to_sock() of dlm/fs/lowcomms.c is not released if
dlm_nodeid_to_addr an error.
Acked-by: Christine Caulfield <ccaulfie@redhat.com>
Signed-off-by: Masatake YAMATO <yamato@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
The fix in commit 3650925893 was addressing
the case of a granted PR lock with waiting PR and CW locks. It's a
special case that requires forcing a CW bast. However, that forced CW
bast was incorrectly applying to a second condition where the granted
lock was CW. So, the holder of a CW lock could receive an extraneous CW
bast instead of a PR bast. This fix narrows the original special case to
what was intended.
Signed-off-by: David Teigland <teigland@redhat.com>
If `device_write' method is called via "dlm-control",
file->private_data is NULL. (See ctl_device_open() in
user.c. ) Through proc->flags is read.
Signed-off-by: Masatake YAMATO <yamato@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
The return value on writes to the plock device should be
the number of bytes written. It was returning 0 instead
when an nfs lock callback was involved.
Reported-by: Nathan Straz <nstraz@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
Removed the section mismatch message:
WARNING: fs/dlm/dlm.o(.init.text+0x132): Section mismatch in reference from the function init_module() to the function .exit.text:dlm_netlink_exit()
Since dlm_netlink_exit() is called in the init_dlm() error handling,
the __exit annotation has been removed.
Signed-off-by: Leonardo Potenza <lpotenza@inwind.it>
Signed-off-by: David Teigland <teigland@redhat.com>
The semaphore connections_lock is used as a mutex. Convert it to the mutex
API.
Signed-off-by: Matthias Kaehlcke <matthias@kaehlcke.net>
Cc: Christine Caulfield <ccaulfie@redhat.com>
Cc: David Teigland <teigland@redhat.com>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: David Teigland <teigland@redhat.com>
__FUNCTION__ is gcc-specific, use __func__
Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/dlm:
dlm: linux/{dlm,dlm_device}.h: cleanup for userspace
dlm: common max length definitions
dlm: move plock code from gfs2
dlm: recover nodes that are removed and re-added
dlm: save master info after failed no-queue request
dlm: make dlm_print_rsb() static
dlm: match signedness between dlm_config_info and cluster_set
Add central definitions for max lockspace name length and max resource
name length. The lack of central definitions has resulted in scattered
private definitions which we can now clean up, including an unused one
in dlm_device.h.
Signed-off-by: David Teigland <teigland@redhat.com>
Move the code that handles cluster posix locks from gfs2 into the dlm
so that it can be used by both gfs2 and ocfs2.
Signed-off-by: David Teigland <teigland@redhat.com>
If a node is removed from a lockspace, and then added back before the
dlm is notified of the removal, the dlm will not detect the removal
and won't clear the old state from the node. This is fixed by using a
list of added nodes so the membership recovery can detect when a newly
added node is already in the member list.
Signed-off-by: David Teigland <teigland@redhat.com>
When a NOQUEUE request fails, the rsb res_master field is unnecessarily
reset to -1, instead of leaving the valid master setting in place. We
want to save the looked-up master values while the rsb is on the "toss
list" so that another lookup can be avoided if the rsb is soon reused.
The fix is to simply leave res_master value alone.
Signed-off-by: David Teigland <teigland@redhat.com>
cluster_set is only called from the macro CLUSTER_ATTR which defines read/write
access functions. Make the signedness match to avoid sparse warnings every time
CLUSTER_ATTR is used (lines 149-159) all of the form:
fs/dlm/config.c:149:1: warning: incorrect type in argument 3 (different signedness)
fs/dlm/config.c:149:1: expected unsigned int *info_field
fs/dlm/config.c:149:1: got int extern [toplevel] *<noident>
Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Signed-off-by: David Teigland <teigland@redhat.com>
The recent patch to validate data lengths in rcom_names messages
failed to account for fake messages a node directs to itself before
ever sending it. In this case we need to fill in the message length
in the header for the validation code to use.
Signed-off-by: David Teigland <teigland@redhat.com>
it moves 365 bytes from .text to .init.text, and 30 bytes from .text to
.exit.text, saves memory.
Signed-off-by: Denis Cheng <crquan@gmail.com>
Signed-off-by: David Teigland <teigland@redhat.com>
a) in device_write(): add sentinel NUL byte, making sure that lspace.name will
be NUL-terminated
b) in compat_input() be keep it simple about the amounts of data we are copying.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: David Teigland <teigland@redhat.com>
... those can happen and BUG() from DLM_ASSERT() in allocate_direntry() is
not a good way to handle them.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: David Teigland <teigland@redhat.com>
We *can* get there from receive_request() and dlm_recover_master_copy()
with namelen too large if incoming request is invalid; BUG() from
DLM_ASSERT() in allocate_rsb() is a bit excessive reaction to that
and in case of dlm_recover_master_copy() we would actually oops before
that while calculating hash of up to 64Kb worth of data - with data
actually being 64 _bytes_ in kmalloc()'ed struct.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: David Teigland <teigland@redhat.com>
* check that length is large enough to cover the non-variable part of message or
rcom resp. (after checking that it's large enough to cover the header, of
course).
* kill more pointless casts
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: David Teigland <teigland@redhat.com>
a) don't cast the pointer to dlm_header *, we use it as dlm_message *
anyway.
b) we copy the message into a queue element, then pass the pointer to
copy to dlm_receive_message_saved(); declare it properly to make sure
that we have the right alignment.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: David Teigland <teigland@redhat.com>
also change name_prefix from char pointer to char array.
Signed-off-by: Denis Cheng <crquan@gmail.com>
Signed-off-by: David Teigland <teigland@redhat.com>
A couple small clean-ups. Remove unnecessary wrapper-functions in
rcom.c, and remove unnecessary casting and an unnecessary ASSERT in
util.c.
Signed-off-by: David Teigland <teigland@redhat.com>
The 32/64 compatibility code in the DLM does not check the validity of
the lock name length passed into it, so it can easily overwrite memory
if the value is rubbish (as early versions of libdlm can cause with
unlock calls, it doesn't zero the field).
This patch restricts the length of the name to the amount of data
actually passed into the call.
Signed-off-by: Patrick Caulfield <pcaulfie@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
To prevent the master of an rsb from changing rapidly, an unused rsb is kept
on the "toss list" for a period of time to be reused. The toss list was
being cleared completely for each recovery, which is unnecessary. Much of
the benefit of the toss list can be maintained if nodes keep rsb's in their
toss list that they are the master of. These rsb's need to be included
when the resource directory is rebuilt during recovery.
Signed-off-by: David Teigland <teigland@redhat.com>
The invalid lockspace messages are normal and can appear relatively
often. They should be suppressed without debugging enabled.
Signed-off-by: David Teigland <teigland@redhat.com>
The dlm_put_lkb() can free the lkb and its associated ua structure,
so we can't depend on using the ua struct after the put.
Signed-off-by: David Teigland <teigland@redhat.com>
In a rare case we may need to repeat a local resource directory lookup
due to a race with removing the rsb and removing the resdir record.
We'll never need to do more than a single additional lookup, though,
so the infinite loop around the lookup can be removed. In addition
to being unnecessary, the infinite loop is dangerous since some other
unknown condition may appear causing the loop to never break.
Signed-off-by: David Teigland <teigland@redhat.com>