linux

Author	SHA1	Message	Date
Hiroaki SHIMODA	696ecdc106	net_sched: gact: Fix potential panic in tcf_gact(). gact_rand array is accessed by gact->tcfg_ptype whose value is assumed to less than MAX_RAND, but any range checks are not performed. So add a check in tcf_gact_init(). And in tcf_gact(), we can reduce a branch. Signed-off-by: Hiroaki SHIMODA <shimoda.hiroaki@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-08-03 16:47:24 -07:00
John W. Linville	1a26904eb6	Merge branch 'for-john' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211	2012-08-02 13:49:38 -04:00
Sylvain Munaut	f0666b1ac8	libceph: fix crypto key null deref, memory leak Avoid crashing if the crypto key payload was NULL, as when it was not correctly allocated and initialized. Also, avoid leaking it. Signed-off-by: Sylvain Munaut <tnt@246tNt.com> Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Alex Elder <elder@inktank.com>	2012-08-02 09:19:20 -07:00
Paul Stewart	899852af60	cfg80211: Clear "beacon_found" on regulatory restore Restore the default state to the "beacon_found" flag when the channel flags are restored. Otherwise, we can end up with a channel that we can no longer transmit on even when we can see beacons on that channel. Signed-off-by: Paul Stewart <pstew@chromium.org> Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2012-08-02 15:34:22 +02:00
Seth Forshee	03f6b0843a	cfg80211: add channel flag to prohibit OFDM operation Currently the only way for wireless drivers to tell whether or not OFDM is allowed on the current channel is to check the regulatory information. However, this requires hodling cfg80211_mutex, which is not visible to the drivers. Other regulatory restrictions are provided as flags in the channel definition, so let's do similarly with OFDM. This patch adds a new flag, IEEE80211_CHAN_NO_OFDM, to tell drivers that OFDM on a channel is not allowed. This flag is set on any channels for which regulatory indicates that OFDM is prohibited. Signed-off-by: Seth Forshee <seth.forshee@canonical.com> Tested-by: Arend van Spriel <arend@broadcom.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2012-08-02 15:30:49 +02:00
Eric Dumazet	e33cdac014	ipv4: route.c cleanup Remove unused includes after IP cache removal Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-08-02 02:54:43 -07:00
Fan Du	e3c0d04750	Fix unexpected SA hard expiration after changing date After SA is setup, one timer is armed to detect soft/hard expiration, however the timer handler uses xtime to do the math. This makes hard expiration occurs first before soft expiration after setting new date with big interval. As a result new child SA is deleted before rekeying the new one. Signed-off-by: Fan Du <fdu@windriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-08-02 00:19:17 -07:00
Ben Hutchings	1485348d24	tcp: Apply device TSO segment limit earlier Cache the device gso_max_segs in sock::sk_gso_max_segs and use it to limit the size of TSO skbs. This avoids the need to fall back to software GSO for local TCP senders. Signed-off-by: Ben Hutchings <bhutchings@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-08-02 00:19:17 -07:00
Ben Hutchings	30b678d844	net: Allow driver to limit number of GSO segments per skb A peer (or local user) may cause TCP to use a nominal MSS of as little as 88 (actual MSS of 76 with timestamps). Given that we have a sufficiently prodigious local sender and the peer ACKs quickly enough, it is nevertheless possible to grow the window for such a connection to the point that we will try to send just under 64K at once. This results in a single skb that expands to 861 segments. In some drivers with TSO support, such an skb will require hundreds of DMA descriptors; a substantial fraction of a TX ring or even more than a full ring. The TX queue selected for the skb may stall and trigger the TX watchdog repeatedly (since the problem skb will be retried after the TX reset). This particularly affects sfc, for which the issue is designated as CVE-2012-3412. Therefore: 1. Add the field net_device::gso_max_segs holding the device-specific limit. 2. In netif_skb_features(), if the number of segments is too high then mask out GSO features to force fall back to software GSO. Signed-off-by: Ben Hutchings <bhutchings@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-08-02 00:19:17 -07:00
Johannes Berg	dd4c9260e7	mac80211: cancel mesh path timer The mesh path timer needs to be canceled when leaving the mesh as otherwise it could fire after the interface has been removed already. Cc: stable@vger.kernel.org Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2012-08-01 21:03:21 +02:00
Johannes Berg	2d9957cce6	mac80211: clear timer bits when disconnecting There's a corner case that can happen when we suspend with a timer running, then resume and disconnect. If we connect again, suspend and resume we might start timers that shouldn't be running. Reset the timer flags to avoid this. This affects both mesh and managed modes. Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2012-08-01 20:58:28 +02:00
Linus Torvalds	a0e881b7c1	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull second vfs pile from Al Viro: "The stuff in there: fsfreeze deadlock fixes by Jan (essentially, the deadlock reproduced by xfstests 068), symlink and hardlink restriction patches, plus assorted cleanups and fixes. Note that another fsfreeze deadlock (emergency thaw one) is not dealt with - the series by Fernando conflicts a lot with Jan's, breaks userland ABI (FIFREEZE semantics gets changed) and trades the deadlock for massive vfsmount leak; this is going to be handled next cycle. There probably will be another pull request, but that stuff won't be in it." Fix up trivial conflicts due to unrelated changes next to each other in drivers/{staging/gdm72xx/usb_boot.c, usb/gadget/storage_common.c} * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (54 commits) delousing target_core_file a bit Documentation: Correct s_umount state for freeze_fs/unfreeze_fs fs: Remove old freezing mechanism ext2: Implement freezing btrfs: Convert to new freezing mechanism nilfs2: Convert to new freezing mechanism ntfs: Convert to new freezing mechanism fuse: Convert to new freezing mechanism gfs2: Convert to new freezing mechanism ocfs2: Convert to new freezing mechanism xfs: Convert to new freezing code ext4: Convert to new freezing mechanism fs: Protect write paths by sb_start_write - sb_end_write fs: Skip atime update on frozen filesystem fs: Add freezing handling to mnt_want_write() / mnt_drop_write() fs: Improve filesystem freezing handling switch the protection of percpu_counter list to spinlock nfsd: Push mnt_want_write() outside of i_mutex btrfs: Push mnt_want_write() outside of i_mutex fat: Push mnt_want_write() outside of i_mutex ...	2012-08-01 10:26:23 -07:00
Linus Torvalds	ac694dbdbc	Merge branch 'akpm' (Andrew's patch-bomb) Merge Andrew's second set of patches: - MM - a few random fixes - a couple of RTC leftovers * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (120 commits) rtc/rtc-88pm80x: remove unneed devm_kfree rtc/rtc-88pm80x: assign ret only when rtc_register_driver fails mm: hugetlbfs: close race during teardown of hugetlbfs shared page tables tmpfs: distribute interleave better across nodes mm: remove redundant initialization mm: warn if pg_data_t isn't initialized with zero mips: zero out pg_data_t when it's allocated memcg: gix memory accounting scalability in shrink_page_list mm/sparse: remove index_init_lock mm/sparse: more checks on mem_section number mm/sparse: optimize sparse_index_alloc memcg: add mem_cgroup_from_css() helper memcg: further prevent OOM with too many dirty pages memcg: prevent OOM with too many dirty pages mm: mmu_notifier: fix freed page still mapped in secondary MMU mm: memcg: only check anon swapin page charges for swap cache mm: memcg: only check swap cache pages for repeated charging mm: memcg: split swapin charge function into private and public part mm: memcg: remove needless !mm fixup to init_mm when charging mm: memcg: remove unneeded shmem charge type ...	2012-07-31 19:25:39 -07:00
Linus Torvalds	3e9a97082f	This patch series contains a major revamp of how we collect entropy from interrupts for /dev/random and /dev/urandom. The goal is to addresses weaknesses discussed in the paper "Mining your Ps and Qs: Detection of Widespread Weak Keys in Network Devices", by Nadia Heninger, Zakir Durumeric, Eric Wustrow, J. Alex Halderman, which will be published in the Proceedings of the 21st Usenix Security Symposium, August 2012. (See https://factorable.net for more information and an extended version of the paper.) -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.12 (GNU/Linux) iQIcBAABCAAGBQJQF/0DAAoJENNvdpvBGATwIowQAOep9QKtLrBvb2lwIRVmeiy8 lRf7V/tYZnz4FePbR0W92JQfKYkCV8yyOO0bmeRzWL3v4m+lRwDTSyA1DDyQMoH+ LOMzvDKSLJMSXTXdSOIr1WYACphViCR/9CrbMBCKSkYfZLJ1MdaEDxT3rcpTGD0T 6iknUweiSkHHhkerU5yQL7FKzD5kYUe0hsF47w7QVlHRHJsW2fsZqkFoh+RpnhNw 03u+djxNGBo9qV81vZ9D1b0vA9uRlEjoWOOEG2XE4M2iq6TUySueA72dQnCwunfi 3kG/u1Swv2dgq6aRrP3H7zdwhYSourGxziu3jNhEKwKEohrxYY7xjNX3RVeTqP67 AzlKsOTWpRLIDrzjSLlb8VxRQiZewu8Unex3e1G+eo20sbcIObHGrxNp7K00zZvd QZiMHhOwItwFTe4lBO+XbqH2JKbL9/uJmwh5EipMpQTraKO9E6N3CJiUHjzBLo2K iGDZxRMKf4gVJRwDxbbP6D70JPVu8ZJ09XVIpsXQ3Z1xNqaMF0QdCmP3ty56q1o0 NvkSXxPKrijZs8Sk0rVDqnJ3ll8PuDnXMv5eDtL42VT818I5WxESn9djjwEanGv0 TYxbFub/NRxmPEE5B2Js5FBpqsLf5f282OSMeS/5WLBbnHJR1OoPoAhGVpHvxntC bi5FC1OolqhvzVIdsqgt =u7KM -----END PGP SIGNATURE----- Merge tag 'random_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/random Pull random subsystem patches from Ted Ts'o: "This patch series contains a major revamp of how we collect entropy from interrupts for /dev/random and /dev/urandom. The goal is to addresses weaknesses discussed in the paper "Mining your Ps and Qs: Detection of Widespread Weak Keys in Network Devices", by Nadia Heninger, Zakir Durumeric, Eric Wustrow, J. Alex Halderman, which will be published in the Proceedings of the 21st Usenix Security Symposium, August 2012. (See https://factorable.net for more information and an extended version of the paper.)" Fix up trivial conflicts due to nearby changes in drivers/{mfd/ab3100-core.c, usb/gadget/omap_udc.c} * tag 'random_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/random: (33 commits) random: mix in architectural randomness in extract_buf() dmi: Feed DMI table to /dev/random driver random: Add comment to random_initialize() random: final removal of IRQF_SAMPLE_RANDOM um: remove IRQF_SAMPLE_RANDOM which is now a no-op sparc/ldc: remove IRQF_SAMPLE_RANDOM which is now a no-op [ARM] pxa: remove IRQF_SAMPLE_RANDOM which is now a no-op board-palmz71: remove IRQF_SAMPLE_RANDOM which is now a no-op isp1301_omap: remove IRQF_SAMPLE_RANDOM which is now a no-op pxa25x_udc: remove IRQF_SAMPLE_RANDOM which is now a no-op omap_udc: remove IRQF_SAMPLE_RANDOM which is now a no-op goku_udc: remove IRQF_SAMPLE_RANDOM which was commented out uartlite: remove IRQF_SAMPLE_RANDOM which is now a no-op drivers: hv: remove IRQF_SAMPLE_RANDOM which is now a no-op xen-blkfront: remove IRQF_SAMPLE_RANDOM which is now a no-op n2_crypto: remove IRQF_SAMPLE_RANDOM which is now a no-op pda_power: remove IRQF_SAMPLE_RANDOM which is now a no-op i2c-pmcmsp: remove IRQF_SAMPLE_RANDOM which is now a no-op input/serio/hp_sdc.c: remove IRQF_SAMPLE_RANDOM which is now a no-op mfd: remove IRQF_SAMPLE_RANDOM which is now a no-op ...	2012-07-31 19:07:42 -07:00
Linus Torvalds	6dbb35b0a7	NFS client updates for Linux 3.6 Features include: - Patches from Bryan to allow splitting of the NFSv2/v3/v4 code into separate modules. - Fix Oopses in the NFSv4 idmapper - Fix a deadlock whereby rpciod tries to allocate a new socket and ends up recursing into the NFS code due to memory reclaim. - Increase the number of permitted callback connections. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.12 (GNU/Linux) iQIcBAABAgAGBQJQGF+vAAoJEGcL54qWCgDy6ncP/jKVl/Yjz6i1b/8QtC0EmYFb XNwbjZicNupVM98XPLm+sfdWxvoGiHSMbwG8t/hx5Z6CteM18adeGNO1KqP9xQyg scPvFmqj5VJNHsHztcHIFHFYdpsuJqRKcW3TA2l8AkNOm6uBcnXArRosYN6LrXik hI/jWcyv8wZuooSQrLf463JK37t/s6LuUQo6jKP4582sbANHvPdLkHFCYg3yt1Sk 2IIo2CLLs2cJUswGJnsHT76Jxfvh21NdGtvydjVjpr4H0/LY5GykBc23AAL9TWj6 KlegDO902hLzEE93xazF59hQZrPuwBi3quEMJ3X0mjdNQWb4G96s/t2hmwpTjWyc mjpRsuaGlCXDxcrI5dnU52hqKa0Ju1ipbLpghxSVfilRy998xvGp5Yc6ADU+pYnt uP09lamCyjFEa+gDSqGt7lv4dFmdPJjWZdkXLPv0ah5sXVMYAT8NRLUfiE1+hLLu zKaM84PHSEvykFeJ2WvjDTgF3L55y3x6L3LN4UkKmfO5nqQf7csPcquU1NfHh25S jjKGq2SKGoYG1l6FwK3hemup5cnFUEX78E2QeLrmaHg92fJDGrz587i6MPc5jrSe sOSX/CpuCJd4VhodIo8T00me0GZSHEU7RP2KEc518ZvrPmz13avXeLeG9nYhjuxM cV9Ex8zh2Y4aiUXCy3uA =KSix -----END PGP SIGNATURE----- Merge tag 'nfs-for-3.6-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs Pull second wave of NFS client updates from Trond Myklebust: - Patches from Bryan to allow splitting of the NFSv2/v3/v4 code into separate modules. - Fix Oopses in the NFSv4 idmapper - Fix a deadlock whereby rpciod tries to allocate a new socket and ends up recursing into the NFS code due to memory reclaim. - Increase the number of permitted callback connections. * tag 'nfs-for-3.6-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: nfs: explicitly reject LOCK_MAND flock() requests nfs: increase number of permitted callback connections. SUNRPC: return negative value in case rpcbind client creation error NFS: Convert v4 into a module NFS: Convert v3 into a module NFS: Convert v2 into a module NFS: Keep module parameters in the generic NFS client NFS: Split out remaining NFS v4 inode functions NFS: Pass super operations and xattr handlers in the nfs_subversion NFS: Only initialize the ACL client in the v3 case NFS: Create a try_mount rpc op NFS: Remove the NFS v4 xdev mount function NFS: Add version registering framework NFS: Fix a number of bugs in the idmapper nfs: skip commit in releasepage if we're freeing memory for fs-related reasons sunrpc: clarify comments on rpc_make_runnable pnfsblock: bail out partial page IO	2012-07-31 18:45:44 -07:00
Linus Torvalds	fd37ce34bd	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net Pull networking update from David S. Miller: "I think Eric Dumazet and I have dealt with all of the known routing cache removal fallout. Some other minor fixes all around. 1) Fix RCU of cached routes, particular of output routes which require liberation via call_rcu() instead of call_rcu_bh(). From Eric Dumazet. 2) Make sure we purge net device references in cached routes properly. 3) TG3 driver bug fixes from Michael Chan. 4) Fix reported 'expires' value in ipv6 routes, from Li Wei. 5) TUN driver ioctl leaks kernel bytes to userspace, from Mathias Krause." * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (22 commits) ipv4: Properly purge netdev references on uncached routes. ipv4: Cache routes in nexthop exception entries. ipv4: percpu nh_rth_output cache ipv4: Restore old dst_free() behavior. bridge: make port attributes const ipv4: remove rt_cache_rebuild_count net: ipv4: fix RCU races on dst refcounts net: TCP early demux cleanup tun: Fix formatting. net/tun: fix ioctl() based info leaks tg3: Update version to 3.124 tg3: Fix race condition in tg3_get_stats64() tg3: Add New 5719 Read DMA workaround tg3: Fix Read DMA workaround for 5719 A0. tg3: Request APE_LOCK_PHY before PHY access ipv6: fix incorrect route 'expires' value passed to userspace mISDN: Bugfix only few bytes are transfered on a connection seeq: use PTR_RET at init_module of driver bnx2x: remove cast around the kmalloc in bnx2x_prev_mark_path ipv4: clean up put_child ...	2012-07-31 18:43:13 -07:00
Mel Gorman	a564b8f039	nfs: enable swap on NFS Implement the new swapfile a_ops for NFS and hook up ->direct_IO. This will set the NFS socket to SOCK_MEMALLOC and run socket reconnect under PF_MEMALLOC as well as reset SOCK_MEMALLOC before engaging the protocol ->connect() method. PF_MEMALLOC should allow the allocation of struct socket and related objects and the early (re)setting of SOCK_MEMALLOC should allow us to receive the packets required for the TCP connection buildup. [jlayton@redhat.com: Restore PF_MEMALLOC task flags in all cases] [dfeng@redhat.com: Fix handling of multiple swap files] [a.p.zijlstra@chello.nl: Original patch] Signed-off-by: Mel Gorman <mgorman@suse.de> Acked-by: Rik van Riel <riel@redhat.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: David S. Miller <davem@davemloft.net> Cc: Eric B Munson <emunson@mgebm.net> Cc: Eric Paris <eparis@redhat.com> Cc: James Morris <jmorris@namei.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Mike Christie <michaelc@cs.wisc.edu> Cc: Neil Brown <neilb@suse.de> Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc> Cc: Trond Myklebust <Trond.Myklebust@netapp.com> Cc: Xiaotian Feng <dfeng@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2012-07-31 18:42:48 -07:00
Mel Gorman	c76562b670	netvm: prevent a stream-specific deadlock This patch series is based on top of "Swap-over-NBD without deadlocking v15" as it depends on the same reservation of PF_MEMALLOC reserves logic. When a user or administrator requires swap for their application, they create a swap partition and file, format it with mkswap and activate it with swapon. In diskless systems this is not an option so if swap if required then swapping over the network is considered. The two likely scenarios are when blade servers are used as part of a cluster where the form factor or maintenance costs do not allow the use of disks and thin clients. The Linux Terminal Server Project recommends the use of the Network Block Device (NBD) for swap but this is not always an option. There is no guarantee that the network attached storage (NAS) device is running Linux or supports NBD. However, it is likely that it supports NFS so there are users that want support for swapping over NFS despite any performance concern. Some distributions currently carry patches that support swapping over NFS but it would be preferable to support it in the mainline kernel. Patch 1 avoids a stream-specific deadlock that potentially affects TCP. Patch 2 is a small modification to SELinux to avoid using PFMEMALLOC reserves. Patch 3 adds three helpers for filesystems to handle swap cache pages. For example, page_file_mapping() returns page->mapping for file-backed pages and the address_space of the underlying swap file for swap cache pages. Patch 4 adds two address_space_operations to allow a filesystem to pin all metadata relevant to a swapfile in memory. Upon successful activation, the swapfile is marked SWP_FILE and the address space operation ->direct_IO is used for writing and ->readpage for reading in swap pages. Patch 5 notes that patch 3 is bolting filesystem-specific-swapfile-support onto the side and that the default handlers have different information to what is available to the filesystem. This patch refactors the code so that there are generic handlers for each of the new address_space operations. Patch 6 adds an API to allow a vector of kernel addresses to be translated to struct pages and pinned for IO. Patch 7 adds support for using highmem pages for swap by kmapping the pages before calling the direct_IO handler. Patch 8 updates NFS to use the helpers from patch 3 where necessary. Patch 9 avoids setting PF_private on PG_swapcache pages within NFS. Patch 10 implements the new swapfile-related address_space operations for NFS and teaches the direct IO handler how to manage kernel addresses. Patch 11 prevents page allocator recursions in NFS by using GFP_NOIO where appropriate. Patch 12 fixes a NULL pointer dereference that occurs when using swap-over-NFS. With the patches applied, it is possible to mount a swapfile that is on an NFS filesystem. Swap performance is not great with a swap stress test taking roughly twice as long to complete than if the swap device was backed by NBD. This patch: netvm: prevent a stream-specific deadlock It could happen that all !SOCK_MEMALLOC sockets have buffered so much data that we're over the global rmem limit. This will prevent SOCK_MEMALLOC buffers from receiving data, which will prevent userspace from running, which is needed to reduce the buffered data. Fix this by exempting the SOCK_MEMALLOC sockets from the rmem limit. Once this change it applied, it is important that sockets that set SOCK_MEMALLOC do not clear the flag until the socket is being torn down. If this happens, a warning is generated and the tokens reclaimed to avoid accounting errors until the bug is fixed. [davem@davemloft.net: Warning about clearing SOCK_MEMALLOC] Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Mel Gorman <mgorman@suse.de> Acked-by: David S. Miller <davem@davemloft.net> Acked-by: Rik van Riel <riel@redhat.com> Cc: Trond Myklebust <Trond.Myklebust@netapp.com> Cc: Neil Brown <neilb@suse.de> Cc: Christoph Hellwig <hch@infradead.org> Cc: Mike Christie <michaelc@cs.wisc.edu> Cc: Eric B Munson <emunson@mgebm.net> Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc> Cc: Mel Gorman <mgorman@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2012-07-31 18:42:47 -07:00
Mel Gorman	b4b9e35585	netvm: set PF_MEMALLOC as appropriate during SKB processing In order to make sure pfmemalloc packets receive all memory needed to proceed, ensure processing of pfmemalloc SKBs happens under PF_MEMALLOC. This is limited to a subset of protocols that are expected to be used for writing to swap. Taps are not allowed to use PF_MEMALLOC as these are expected to communicate with userspace processes which could be paged out. [a.p.zijlstra@chello.nl: Ideas taken from various patches] [jslaby@suse.cz: Lock imbalance fix] Signed-off-by: Mel Gorman <mgorman@suse.de> Acked-by: David S. Miller <davem@davemloft.net> Cc: Neil Brown <neilb@suse.de> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Christie <michaelc@cs.wisc.edu> Cc: Eric B Munson <emunson@mgebm.net> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc> Cc: Mel Gorman <mgorman@suse.de> Cc: Christoph Lameter <cl@linux.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2012-07-31 18:42:46 -07:00
Mel Gorman	c93bdd0e03	netvm: allow skb allocation to use PFMEMALLOC reserves Change the skb allocation API to indicate RX usage and use this to fall back to the PFMEMALLOC reserve when needed. SKBs allocated from the reserve are tagged in skb->pfmemalloc. If an SKB is allocated from the reserve and the socket is later found to be unrelated to page reclaim, the packet is dropped so that the memory remains available for page reclaim. Network protocols are expected to recover from this packet loss. [a.p.zijlstra@chello.nl: Ideas taken from various patches] [davem@davemloft.net: Use static branches, coding style corrections] [sebastian@breakpoint.cc: Avoid unnecessary cast, fix !CONFIG_NET build] Signed-off-by: Mel Gorman <mgorman@suse.de> Acked-by: David S. Miller <davem@davemloft.net> Cc: Neil Brown <neilb@suse.de> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Christie <michaelc@cs.wisc.edu> Cc: Eric B Munson <emunson@mgebm.net> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc> Cc: Mel Gorman <mgorman@suse.de> Cc: Christoph Lameter <cl@linux.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2012-07-31 18:42:46 -07:00
Mel Gorman	7cb0240492	netvm: allow the use of __GFP_MEMALLOC by specific sockets Allow specific sockets to be tagged SOCK_MEMALLOC and use __GFP_MEMALLOC for their allocations. These sockets will be able to go below watermarks and allocate from the emergency reserve. Such sockets are to be used to service the VM (iow. to swap over). They must be handled kernel side, exposing such a socket to user-space is a bug. There is a risk that the reserves be depleted so for now, the administrator is responsible for increasing min_free_kbytes as necessary to prevent deadlock for their workloads. [a.p.zijlstra@chello.nl: Original patches] Signed-off-by: Mel Gorman <mgorman@suse.de> Acked-by: David S. Miller <davem@davemloft.net> Cc: Neil Brown <neilb@suse.de> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Christie <michaelc@cs.wisc.edu> Cc: Eric B Munson <emunson@mgebm.net> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc> Cc: Mel Gorman <mgorman@suse.de> Cc: Christoph Lameter <cl@linux.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2012-07-31 18:42:46 -07:00
Mel Gorman	99a1dec70d	net: introduce sk_gfp_atomic() to allow addition of GFP flags depending on the individual socket Introduce sk_gfp_atomic(), this function allows to inject sock specific flags to each sock related allocation. It is only used on allocation paths that may be required for writing pages back to network storage. [davem@davemloft.net: Use sk_gfp_atomic only when necessary] Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Mel Gorman <mgorman@suse.de> Acked-by: David S. Miller <davem@davemloft.net> Cc: Neil Brown <neilb@suse.de> Cc: Mike Christie <michaelc@cs.wisc.edu> Cc: Eric B Munson <emunson@mgebm.net> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc> Cc: Mel Gorman <mgorman@suse.de> Cc: Christoph Lameter <cl@linux.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2012-07-31 18:42:46 -07:00
Andrew Morton	c255a45805	memcg: rename config variables Sanity: CONFIG_CGROUP_MEM_RES_CTLR -> CONFIG_MEMCG CONFIG_CGROUP_MEM_RES_CTLR_SWAP -> CONFIG_MEMCG_SWAP CONFIG_CGROUP_MEM_RES_CTLR_SWAP_ENABLED -> CONFIG_MEMCG_SWAP_ENABLED CONFIG_CGROUP_MEM_RES_CTLR_KMEM -> CONFIG_MEMCG_KMEM [mhocko@suse.cz: fix missed bits] Cc: Glauber Costa <glommer@parallels.com> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Hugh Dickins <hughd@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: David Rientjes <rientjes@google.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2012-07-31 18:42:43 -07:00
David S. Miller	caacf05e5a	ipv4: Properly purge netdev references on uncached routes. When a device is unregistered, we have to purge all of the references to it that may exist in the entire system. If a route is uncached, we currently have no way of accomplishing this. So create a global list that is scanned when a network device goes down. This mirrors the logic in net/core/dst.c's dst_ifdown(). Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-31 15:06:50 -07:00
David S. Miller	c5038a8327	ipv4: Cache routes in nexthop exception entries. Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-31 15:02:02 -07:00
Linus Torvalds	08843b79fb	Merge branch 'nfsd-next' of git://linux-nfs.org/~bfields/linux Pull nfsd changes from J. Bruce Fields: "This has been an unusually quiet cycle--mostly bugfixes and cleanup. The one large piece is Stanislav's work to containerize the server's grace period--but that in itself is just one more step in a not-yet-complete project to allow fully containerized nfs service. There are a number of outstanding delegation, container, v4 state, and gss patches that aren't quite ready yet; 3.7 may be wilder." * 'nfsd-next' of git://linux-nfs.org/~bfields/linux: (35 commits) NFSd: make boot_time variable per network namespace NFSd: make grace end flag per network namespace Lockd: move grace period management from lockd() to per-net functions LockD: pass actual network namespace to grace period management functions LockD: manage grace list per network namespace SUNRPC: service request network namespace helper introduced NFSd: make nfsd4_manager allocated per network namespace context. LockD: make lockd manager allocated per network namespace LockD: manage grace period per network namespace Lockd: add more debug to host shutdown functions Lockd: host complaining function introduced LockD: manage used host count per networks namespace LockD: manage garbage collection timeout per networks namespace LockD: make garbage collector network namespace aware. LockD: mark host per network namespace on garbage collect nfsd4: fix missing fault_inject.h include locks: move lease-specific code out of locks_delete_lock locks: prevent side-effects of locks_release_private before file_lock is initialized NFSd: set nfsd_serv to NULL after service destruction NFSd: introduce nfsd_destroy() helper ...	2012-07-31 14:42:28 -07:00
Eric Dumazet	d26b3a7c4b	ipv4: percpu nh_rth_output cache Input path is mostly run under RCU and doesnt touch dst refcnt But output path on forwarding or UDP workloads hits badly dst refcount, and we have lot of false sharing, for example in ipv4_mtu() when reading rt->rt_pmtu Using a percpu cache for nh_rth_output gives a nice performance increase at a small cost. 24 udpflood test on my 24 cpu machine (dummy0 output device) (each process sends 1.000.000 udp frames, 24 processes are started) before : 5.24 s after : 2.06 s For reference, time on linux-3.5 : 6.60 s Signed-off-by: Eric Dumazet <edumazet@google.com> Tested-by: Alexander Duyck <alexander.h.duyck@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-31 14:41:39 -07:00
Eric Dumazet	54764bb647	ipv4: Restore old dst_free() behavior. commit `404e0a8b6a` (net: ipv4: fix RCU races on dst refcounts) tried to solve a race but added a problem at device/fib dismantle time : We really want to call dst_free() as soon as possible, even if sockets still have dst in their cache. dst_release() calls in free_fib_info_rcu() are not welcomed. Root of the problem was that now we also cache output routes (in nh_rth_output), we must use call_rcu() instead of call_rcu_bh() in rt_free(), because output route lookups are done in process context. Based on feedback and initial patch from David Miller (adding another call_rcu_bh() call in fib, but it appears it was not the right fix) I left the inet_sk_rx_dst_set() helper and added __rcu attributes to nh_rth_output and nh_rth_input to better document what is going on in this code. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-31 14:41:38 -07:00
Linus Torvalds	cc8362b1f6	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client Pull Ceph changes from Sage Weil: "Lots of stuff this time around: - lots of cleanup and refactoring in the libceph messenger code, and many hard to hit races and bugs closed as a result. - lots of cleanup and refactoring in the rbd code from Alex Elder, mostly in preparation for the layering functionality that will be coming in 3.7. - some misc rbd cleanups from Josh Durgin that are finally going upstream - support for CRUSH tunables (used by newer clusters to improve the data placement) - some cleanup in our use of d_parent that Al brought up a while back - a random collection of fixes across the tree There is another patch coming that fixes up our ->atomic_open() behavior, but I'm going to hammer on it a bit more before sending it." Fix up conflicts due to commits that were already committed earlier in drivers/block/rbd.c, net/ceph/{messenger.c, osd_client.c} * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (132 commits) rbd: create rbd_refresh_helper() rbd: return obj version in __rbd_refresh_header() rbd: fixes in rbd_header_from_disk() rbd: always pass ops array to rbd_req_sync_op() rbd: pass null version pointer in add_snap() rbd: make rbd_create_rw_ops() return a pointer rbd: have __rbd_add_snap_dev() return a pointer libceph: recheck con state after allocating incoming message libceph: change ceph_con_in_msg_alloc convention to be less weird libceph: avoid dropping con mutex before fault libceph: verify state after retaking con lock after dispatch libceph: revoke mon_client messages on session restart libceph: fix handling of immediate socket connect failure ceph: update MAINTAINERS file libceph: be less chatty about stray replies libceph: clear all flags on con_close libceph: clean up con flags libceph: replace connection state bits with states libceph: drop unnecessary CLOSED check in socket state change callback libceph: close socket directly from ceph_con_close() ...	2012-07-31 14:35:28 -07:00
Linus Torvalds	1fad1e9a74	NFS client updates for Linux 3.6 Features include: - More preparatory patches for modularising NFSv2/v3/v4. Split out the various NFSv2/v3/v4-specific code into separate files - More preparation for the NFSv4 migration code - Ensure that OPEN(O_CREATE) observes the pNFS mds threshold parameters - pNFS fast failover when the data servers are down - Various cleanups and debugging patches -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.12 (GNU/Linux) iQIcBAABAgAGBQJQFwYRAAoJEGcL54qWCgDy6YAQAIZuUPFCQbuzWev4L/XA0uhg 0xjr9YBmg57UOs8e7FhS0azJip+D/kmF2Rx5gCMX0QO/IwaEVoms4BS8zjFOr3yL cPIyY5ErF+WJ25f3Mz8k0wOHTzrOvMdq4/7TQf5hztsrGYFid78sSb3O9ZO9bgb4 QxqhgA0Z4S2vIqeObJAF9BT5UOT/cXfVKV0iTd52ZQVA+ZhqJ2SDlNFsoxnVOweV qzKOi0FZCPsjH+Z4a/LLdieHG5AYRPhOScBdG7gTFAdkysI8DDtSNCmJ68dMHYRw OHtyt/X4k9TImfwauV3Eww1sw3NkDswX+dv3GOSDTlMvVBrm0WXWj2zoHZbOB+0Q ETI9Zpolvd9pADtyfu9c+kCYIR+NdB4lGKd15iZfdEy/CZ0CtbLAbn5Szo2YcwNR iVjswR0jsbklD+mZjlMeZoG4JX2d9ZRqLs20POz7QQBmdIQ8jb9/SwVj9zGvX0w0 NyVUHTFlGaNwkpo7oeRoWi1xjZWE6OzfoncV/46DOJWb08OKZ4fR4ugMYJWGi7Xx I4nQrIiHtInqL+sF5Y3wlZ48dufLuIhAcHh7klxgud4GRj7t8j+a99pTrRmpHvvC 4K2Yu69VYcIA7N2RsSLohIllGPFmMwpdar7yERdD0So8/fz/zf7wn0dFFYY11+bL YNVVGhvNxccMU5EU9YR4 =Lc59 -----END PGP SIGNATURE----- Merge tag 'nfs-for-3.6-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs Pull NFS client updates from Trond Myklebust: "Features include: - More preparatory patches for modularising NFSv2/v3/v4. Split out the various NFSv2/v3/v4-specific code into separate files - More preparation for the NFSv4 migration code - Ensure that OPEN(O_CREATE) observes the pNFS mds threshold parameters - pNFS fast failover when the data servers are down - Various cleanups and debugging patches" * tag 'nfs-for-3.6-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (67 commits) nfs: fix fl_type tests in NFSv4 code NFS: fix pnfs regression with directio writes NFS: fix pnfs regression with directio reads sunrpc: clnt: Add missing braces nfs: fix stub return type warnings NFS: exit_nfs_v4() shouldn't be an __exit function SUNRPC: Add a missing spin_unlock to gss_mech_list_pseudoflavors NFS: Split out NFS v4 client functions NFS: Split out the NFS v4 filesystem types NFS: Create a single nfs_clone_super() function NFS: Split out NFS v4 server creating code NFS: Initialize the NFS v4 client from init_nfs_v4() NFS: Move the v4 getroot code to nfs4getroot.c NFS: Split out NFS v4 file operations NFS: Initialize v4 sysctls from nfs_init_v4() NFS: Create an init_nfs_v4() function NFS: Split out NFS v4 inode operations NFS: Split out NFS v3 inode operations NFS: Split out NFS v2 inode operations NFS: Clean up nfs4_proc_setclientid() and friends ...	2012-07-30 19:16:57 -07:00
Sage Weil	6139919133	libceph: recheck con state after allocating incoming message We drop the lock when calling the ->alloc_msg() con op, which means we need to (a) not clobber con->in_msg without the mutex held, and (b) we need to verify that we are still in the OPEN state when we retake it to avoid causing any mayhem. If the state does change, -EAGAIN will get us back to con_work() and loop. Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Alex Elder <elder@inktank.com>	2012-07-30 18:19:45 -07:00
Sage Weil	4740a623d2	libceph: change ceph_con_in_msg_alloc convention to be less weird This function's calling convention is very limiting. In particular, we can't return any error other than ENOMEM (and only implicitly), which is a problem (see next patch). Instead, return an normal 0 or error code, and make the skip a pointer output parameter. Drop the useless in_hdr argument (we have the con pointer). Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Alex Elder <elder@inktank.com>	2012-07-30 18:19:30 -07:00
Sage Weil	8636ea672f	libceph: avoid dropping con mutex before fault The ceph_fault() function takes the con mutex, so we should avoid dropping it before calling it. This fixes a potential race with another thread calling ceph_con_close(), or _open(), or similar (we don't reverify con->state after retaking the lock). Add annotation so that lockdep realizes we will drop the mutex before returning. Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Alex Elder <elder@inktank.com>	2012-07-30 18:17:13 -07:00
Sage Weil	7b862e07b1	libceph: verify state after retaking con lock after dispatch We drop the con mutex when delivering a message. When we retake the lock, we need to verify we are still in the OPEN state before preparing to read the next tag, or else we risk stepping on a connection that has been closed. Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Alex Elder <elder@inktank.com>	2012-07-30 18:16:56 -07:00
Sage Weil	4f471e4a9c	libceph: revoke mon_client messages on session restart Revoke all mon_client messages when we shut down the old connection. This is mostly moot since we are re-using the same ceph_connection, but it is cleaner. Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Alex Elder <elder@inktank.com>	2012-07-30 18:16:40 -07:00
Sage Weil	8007b8d626	libceph: fix handling of immediate socket connect failure If the connect() call immediately fails such that sock == NULL, we still need con_close_socket() to reset our socket state to CLOSED. Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Alex Elder <elder@inktank.com>	2012-07-30 18:16:16 -07:00
Sage Weil	756a16a5d5	libceph: be less chatty about stray replies There are many (normal) conditions that can lead to us getting unexpected replies, include cluster topology changes, osd failures, and timeouts. There's no need to spam the console about it. Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Alex Elder <elder@inktank.com>	2012-07-30 18:16:03 -07:00
Sage Weil	43c7427d10	libceph: clear all flags on con_close Signed-off-by: Sage Weil <sage@inktank.com>	2012-07-30 18:16:02 -07:00
Sage Weil	4a86169208	libceph: clean up con flags Rename flags with CON_FLAG prefix, move the definitions into the c file, and (better) document their meaning. Signed-off-by: Sage Weil <sage@inktank.com>	2012-07-30 18:16:01 -07:00
Sage Weil	8dacc7da69	libceph: replace connection state bits with states Use a simple set of 6 enumerated values for the socket states (CON_STATE_*) and use those instead of the state bits. All of the con->state checks are now under the protection of the con mutex, so this is safe. It also simplifies many of the state checks because we can check for anything other than the expected state instead of various bits for races we can think of. This appears to hold up well to stress testing both with and without socket failure injection on the server side. Signed-off-by: Sage Weil <sage@inktank.com>	2012-07-30 18:16:00 -07:00
Sage Weil	d7353dd5aa	libceph: drop unnecessary CLOSED check in socket state change callback If we are CLOSED, the socket is closed and we won't get these. Signed-off-by: Sage Weil <sage@inktank.com>	2012-07-30 18:15:59 -07:00
Sage Weil	ee76e0736d	libceph: close socket directly from ceph_con_close() It is simpler to do this immediately, since we already hold the con mutex. It also avoids the need to deal with a not-quite-CLOSED socket in con_work. Signed-off-by: Sage Weil <sage@inktank.com>	2012-07-30 18:15:58 -07:00
Sage Weil	2e8cb10063	libceph: drop gratuitous socket close calls in con_work If the state is CLOSED or OPENING, we shouldn't have a socket. Signed-off-by: Sage Weil <sage@inktank.com>	2012-07-30 18:15:58 -07:00
Sage Weil	a59b55a602	libceph: move ceph_con_send() closed check under the con mutex Take the con mutex before checking whether the connection is closed to avoid racing with someone else closing it. Signed-off-by: Sage Weil <sage@inktank.com>	2012-07-30 18:15:57 -07:00
Sage Weil	00650931e5	libceph: move msgr clear_standby under con mutex protection Avoid dropping and retaking con->mutex in the ceph_con_send() case by leaving locking up to the caller. Signed-off-by: Sage Weil <sage@inktank.com>	2012-07-30 18:15:56 -07:00
Sage Weil	3b5ede07b5	libceph: fix fault locking; close socket on lossy fault If we fault on a lossy connection, we should still close the socket immediately, and do so under the con mutex. We should also take the con mutex before printing out the state bits in the debug output. Signed-off-by: Sage Weil <sage@inktank.com>	2012-07-30 18:15:55 -07:00
Jiaju Zhang	048a9d2d06	libceph: trivial fix for the incorrect debug output This is a trivial fix for the debug output, as it is inconsistent with the function name so may confuse people when debugging. [elder@inktank.com: switched to use __func__] Signed-off-by: Jiaju Zhang <jjzhang@suse.de> Reviewed-by: Alex Elder <elder@inktank.com>	2012-07-30 18:15:36 -07:00
Sage Weil	85effe183d	libceph: reset connection retry on successfully negotiation We exponentially back off when we encounter connection errors. If several errors accumulate, we will eventually wait ages before even trying to reconnect. Fix this by resetting the backoff counter after a successful negotiation/ connection with the remote node. Fixes ceph issue #2802. Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Yehuda Sadeh <yehuda@inktank.com> Reviewed-by: Alex Elder <elder@inktank.com>	2012-07-30 18:15:34 -07:00
Sage Weil	5469155f2b	libceph: protect ceph_con_open() with mutex Take the con mutex while we are initiating a ceph open. This is necessary because the may have previously been in use and then closed, which could result in a racing workqueue running con_work(). Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Yehuda Sadeh <yehuda@inktank.com> Reviewed-by: Alex Elder <elder@inktank.com>	2012-07-30 18:15:33 -07:00
Sage Weil	a410702697	libceph: (re)initialize bio_iter on start of message receive Previously, we were opportunistically initializing the bio_iter if it appeared to be uninitialized in the middle of the read path. The problem is that a sequence like: - start reading message - initialize bio_iter - read half a message - messenger fault, reconnect - restart reading message - bio_iter now non-NULL, not reinitialized - read past end of bio, crash Instead, initialize the bio_iter unconditionally when we allocate/claim the message for read. Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Alex Elder <elder@inktank.com> Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>	2012-07-30 18:15:31 -07:00
Sage Weil	6194ea895e	libceph: resubmit linger ops when pg mapping changes The linger op registration (i.e., watch) modifies the object state. As such, the OSD will reply with success if it has already applied without doing the associated side-effects (setting up the watch session state). If we lose the ACK and resubmit, we will see success but the watch will not be correctly registered and we won't get notifies. To fix this, always resubmit the linger op with a new tid. We accomplish this by re-registering as a linger (i.e., 'registered') if we are not yet registered. Then the second loop will treat this just like a normal case of re-registering. This mirrors a similar fix on the userland ceph.git, commit 5dd68b95, and ceph bug #2796. Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Alex Elder <elder@inktank.com> Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>	2012-07-30 18:15:31 -07:00
Sage Weil	8c50c81756	libceph: fix mutex coverage for ceph_con_close Hold the mutex while twiddling all of the state bits to avoid possible races. While we're here, make not of why we cannot close the socket directly. Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Alex Elder <elder@inktank.com> Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>	2012-07-30 18:15:30 -07:00
Sage Weil	3a140a0d5c	libceph: report socket read/write error message We need to set error_msg to something useful before calling ceph_fault(); do so here for try_{read,write}(). This is more informative than libceph: osd0 192.168.106.220:6801 (null) Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Alex Elder <elder@inktank.com> Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>	2012-07-30 18:15:29 -07:00
Sage Weil	546f04ef71	libceph: support crush tunables The server side recently added support for tuning some magic crush variables. Decode these variables if they are present, or use the default values if they are not present. Corresponds to ceph.git commit 89af369c25f274fe62ef730e5e8aad0c54f1e5a5. Signed-off-by: caleb miles <caleb.miles@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com> Reviewed-by: Alex Elder <elder@inktank.com> Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>	2012-07-30 18:15:23 -07:00
Stanislav Kinsbursky	caea33da89	SUNRPC: return negative value in case rpcbind client creation error Without this patch kernel will panic on LockD start, because lockd_up() checks lockd_up_net() result for negative value. From my pow it's better to return negative value from rpcbind routines instead of replacing all such checks like in lockd_up(). Signed-off-by: Stanislav Kinsbursky <skinsbursky@parallels.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> Cc: stable@vger.kernel.org [>= 3.0]	2012-07-30 20:39:05 -04:00
Sage Weil	1fe60e51a3	libceph: move feature bits to separate header This is simply cleanup that will keep things more closely synced with the userland code. Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Alex Elder <elder@inktank.com> Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>	2012-07-30 16:23:22 -07:00
Jeff Layton	5cf02d09b5	nfs: skip commit in releasepage if we're freeing memory for fs-related reasons We've had some reports of a deadlock where rpciod ends up with a stack trace like this: PID: 2507 TASK: ffff88103691ab40 CPU: 14 COMMAND: "rpciod/14" #0 [ffff8810343bf2f0] schedule at ffffffff814dabd9 #1 [ffff8810343bf3b8] nfs_wait_bit_killable at ffffffffa038fc04 [nfs] #2 [ffff8810343bf3c8] __wait_on_bit at ffffffff814dbc2f #3 [ffff8810343bf418] out_of_line_wait_on_bit at ffffffff814dbcd8 #4 [ffff8810343bf488] nfs_commit_inode at ffffffffa039e0c1 [nfs] #5 [ffff8810343bf4f8] nfs_release_page at ffffffffa038bef6 [nfs] #6 [ffff8810343bf528] try_to_release_page at ffffffff8110c670 #7 [ffff8810343bf538] shrink_page_list.clone.0 at ffffffff81126271 #8 [ffff8810343bf668] shrink_inactive_list at ffffffff81126638 #9 [ffff8810343bf818] shrink_zone at ffffffff8112788f #10 [ffff8810343bf8c8] do_try_to_free_pages at ffffffff81127b1e #11 [ffff8810343bf958] try_to_free_pages at ffffffff8112812f #12 [ffff8810343bfa08] __alloc_pages_nodemask at ffffffff8111fdad #13 [ffff8810343bfb28] kmem_getpages at ffffffff81159942 #14 [ffff8810343bfb58] fallback_alloc at ffffffff8115a55a #15 [ffff8810343bfbd8] ____cache_alloc_node at ffffffff8115a2d9 #16 [ffff8810343bfc38] kmem_cache_alloc at ffffffff8115b09b #17 [ffff8810343bfc78] sk_prot_alloc at ffffffff81411808 #18 [ffff8810343bfcb8] sk_alloc at ffffffff8141197c #19 [ffff8810343bfce8] inet_create at ffffffff81483ba6 #20 [ffff8810343bfd38] __sock_create at ffffffff8140b4a7 #21 [ffff8810343bfd98] xs_create_sock at ffffffffa01f649b [sunrpc] #22 [ffff8810343bfdd8] xs_tcp_setup_socket at ffffffffa01f6965 [sunrpc] #23 [ffff8810343bfe38] worker_thread at ffffffff810887d0 #24 [ffff8810343bfee8] kthread at ffffffff8108dd96 #25 [ffff8810343bff48] kernel_thread at ffffffff8100c1ca rpciod is trying to allocate memory for a new socket to talk to the server. The VM ends up calling ->releasepage to get more memory, and it tries to do a blocking commit. That commit can't succeed however without a connected socket, so we deadlock. Fix this by setting PF_FSTRANS on the workqueue task prior to doing the socket allocation, and having nfs_release_page check for that flag when deciding whether to do a commit call. Also, set PF_FSTRANS unconditionally in rpc_async_schedule since that function can also do allocations sometimes. Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> Cc: stable@vger.kernel.org	2012-07-30 18:55:59 -04:00
Jeff Layton	506026c3ec	sunrpc: clarify comments on rpc_make_runnable rpc_make_runnable is not generally called with the queue lock held, unless it's waking up a task that has been sitting on a waitqueue. This is safe when the task has not entered the FSM yet, but the comments don't really spell this out. Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2012-07-30 18:55:08 -04:00
Joe Perches	cac5d07e3c	sunrpc: clnt: Add missing braces Add a missing set of braces that commit `4e0038b6b2` ("SUNRPC: Move clnt->cl_server into struct rpc_xprt") forgot. Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> Cc: stable@vger.kernel.org [>= 3.4]	2012-07-30 17:58:08 -04:00
stephen hemminger	5a0d513b62	bridge: make port attributes const Simple table that can be marked const. Signed-off-by: Stephen Hemminger <shemminger@vyatta.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-30 14:53:22 -07:00
Eric Dumazet	0c7462a235	ipv4: remove rt_cache_rebuild_count After IP route cache removal, rt_cache_rebuild_count is no longer used. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-30 14:53:22 -07:00
Eric Dumazet	404e0a8b6a	net: ipv4: fix RCU races on dst refcounts commit `c6cffba4ff` (ipv4: Fix input route performance regression.) added various fatal races with dst refcounts. crashes happen on tcp workloads if routes are added/deleted at the same time. The dst_free() calls from free_fib_info_rcu() are clearly racy. We need instead regular dst refcounting (dst_release()) and make sure dst_release() is aware of RCU grace periods : Add DST_RCU_FREE flag so that dst_release() respects an RCU grace period before dst destruction for cached dst Introduce a new inet_sk_rx_dst_set() helper, using atomic_inc_not_zero() to make sure we dont increase a zero refcount (On a dst currently waiting an rcu grace period before destruction) rt_cache_route() must take a reference on the new cached route, and release it if was not able to install it. With this patch, my machines survive various benchmarks. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-30 14:53:22 -07:00
Eric Dumazet	cca32e4bf9	net: TCP early demux cleanup early_demux() handlers should be called in RCU context, and as we use skb_dst_set_noref(skb, dst), caller must not exit from RCU context before dst use (skb_dst(skb)) or release (skb_drop(dst)) Therefore, rcu_read_lock()/rcu_read_unlock() pairs around ->early_demux() are confusing and not needed : Protocol handlers are already in an RCU read lock section. (__netif_receive_skb() does the rcu_read_lock() ) Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-30 14:53:21 -07:00
Guanjun He	a2a3258417	libceph: prevent the race of incoming work during teardown Add an atomic variable 'stopping' as flag in struct ceph_messenger, set this flag to 1 in function ceph_destroy_client(), and add the condition code in function ceph_data_ready() to test the flag value, if true(1), just return. Signed-off-by: Guanjun He <gjhe@suse.com> Reviewed-by: Sage Weil <sage@inktank.com>	2012-07-30 09:29:53 -07:00
Sage Weil	a16cb1f707	libceph: fix messenger retry In ancient times, the messenger could both initiate and accept connections. An artifact if that was data structures to store/process an incoming ceph_msg_connect request and send an outgoing ceph_msg_connect_reply. Sadly, the negotiation code was referencing those structures and ignoring important information (like the peer's connect_seq) from the correct ones. Among other things, this fixes tight reconnect loops where the server sends RETRY_SESSION and we (the client) retries with the same connect_seq as last time. This bug pretty easily triggered by injecting socket failures on the MDS and running some fs workload like workunits/direct_io/test_sync_io. Signed-off-by: Sage Weil <sage@inktank.com>	2012-07-30 09:29:52 -07:00
Sage Weil	cd43045c2d	libceph: initialize rb, list nodes in ceph_osd_request These don't strictly need to be initialized based on how they are used, but it is good practice to do so. Reported-by: Alex Elder <elder@inktank.com> Signed-off-by: Sage Weil <sage@inktank.com>	2012-07-30 09:29:51 -07:00
Sage Weil	d50b409fb8	libceph: initialize msgpool message types Initialize the type field for messages in a msgpool. The caller was doing this for osd ops, but not for the reply messages. Reported-by: Alex Elder <elder@inktank.com> Signed-off-by: Sage Weil <sage@inktank.com>	2012-07-30 09:29:50 -07:00
Eliad Peller	ba846a502c	mac80211: don't clear sched_scan_sdata on sched scan stop request ieee80211_request_sched_scan_stop() cleared local->sched_scan_sdata. However, sched_scan_sdata should be cleared only after the driver calls ieee80211_sched_scan_stopped() (like with normal hw scan). Clearing sched_scan_sdata too early caused ieee80211_sched_scan_stopped_work to exit prematurely without properly cleaning all the sched scan resources and without calling cfg80211_sched_scan_stopped (so userspace wasn't notified about sched scan completion). Signed-off-by: Eliad Peller <eliad@wizery.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2012-07-30 09:14:36 +02:00
Johannes Berg	fcb06702f0	Merge remote-tracking branch 'wireless/master' into mac80211	2012-07-30 09:13:03 +02:00
Li Wei	8253947e2c	ipv6: fix incorrect route 'expires' value passed to userspace When userspace use RTM_GETROUTE to dump route table, with an already expired route entry, we always got an 'expires' value(2147157) calculated base on INT_MAX. The reason of this problem is in the following satement: rt->dst.expires - jiffies < INT_MAX gcc promoted the type of both sides of '<' to unsigned long, thus a small negative value would be considered greater than INT_MAX. With the help of Eric Dumazet, do the out of bound checks in rtnl_put_cacheinfo(), _after_ conversion to clock_t. Signed-off-by: Li Wei <lw@cn.fujitsu.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-29 23:18:31 -07:00
Lin Ming	61648d91fc	ipv4: clean up put_child The first parameter struct trie *t is not used anymore. Remove it. Signed-off-by: Lin Ming <mlin@ss.pku.edu.cn> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-29 23:18:30 -07:00
Lin Ming	4ea4bf7ebc	ipv4: fix debug info in tnode_new It should print size of struct rt_trie_node * allocated instead of size of struct rt_trie_node. Signed-off-by: Lin Ming <mlin@ss.pku.edu.cn> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-29 23:18:30 -07:00
Al Viro	faf0201029	clean unix_bind() up a bit Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2012-07-29 21:24:15 +04:00
Al Viro	a8104a9fcd	pull mnt_want_write()/mnt_drop_write() into kern_path_create()/done_path_create() resp. One side effect - attempt to create a cross-device link on a read-only fs fails with EROFS instead of EXDEV now. Makes more sense, POSIX allows, etc. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2012-07-29 21:24:15 +04:00
Al Viro	921a1650de	new helper: done_path_create() releases what needs to be released after {kern,user}_path_create() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2012-07-29 21:24:13 +04:00
Jiri Kosina	59ea33a68a	tcp: perform DMA to userspace only if there is a task waiting for it Back in 2006, commit `1a2449a87b` ("[I/OAT]: TCP recv offload to I/OAT") added support for receive offloading to IOAT dma engine if available. The code in tcp_rcv_established() tries to perform early DMA copy if applicable. It however does so without checking whether the userspace task is actually expecting the data in the buffer. This is not a problem under normal circumstances, but there is a corner case where this doesn't work -- and that's when MSG_TRUNC flag to recvmsg() is used. If the IOAT dma engine is not used, the code properly checks whether there is a valid ucopy.task and the socket is owned by userspace, but misses the check in the dmaengine case. This problem can be observed in real trivially -- for example 'tbench' is a good reproducer, as it makes a heavy use of MSG_TRUNC. On systems utilizing IOAT, you will soon find tbench waiting indefinitely in sk_wait_data(), as they have been already early-copied in tcp_rcv_established() using dma engine. This patch introduces the same check we are performing in the simple iovec copy case to the IOAT case as well. It fixes the indefinite recvmsg(MSG_TRUNC) hangs. Signed-off-by: Jiri Kosina <jkosina@suse.cz> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-27 13:45:51 -07:00
Jesse Gross	6081030769	Revert "openvswitch: potential NULL deref in sample()" This reverts commit `5b3e7e6cb5`. The problem that the original commit was attempting to fix can never happen in practice because validation is done one a per-flow basis rather than a per-packet basis. Adding additional checks at runtime is unnecessary and inconsistent with the rest of the code. CC: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Jesse Gross <jesse@nicira.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-27 13:45:51 -07:00
Eric Dumazet	505fbcf035	ipv4: fix TCP early demux commit `92101b3b2e` (ipv4: Prepare for change of rt->rt_iif encoding.) invalidated TCP early demux, because rx_dst_ifindex is not properly initialized and checked. Also remove the use of inet_iif(skb) in favor or skb->skb_iif Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-27 13:45:51 -07:00
Jiri Benc	b1beb681cb	net: fix rtnetlink IFF_PROMISC and IFF_ALLMULTI handling When device flags are set using rtnetlink, IFF_PROMISC and IFF_ALLMULTI flags are handled specially. Function dev_change_flags sets IFF_PROMISC and IFF_ALLMULTI bits in dev->gflags according to the passed value but do_setlink passes a result of rtnl_dev_combine_flags which takes those bits from dev->flags. This can be easily trigerred by doing: tcpdump -i eth0 & ip l s up eth0 ip sets IFF_UP flag in ifi_flags and ifi_change, which is combined with IFF_PROMISC by rtnl_dev_combine_flags, causing __dev_change_flags to set IFF_PROMISC in gflags. Reported-by: Max Matveev <makc@redhat.com> Signed-off-by: Jiri Benc <jbenc@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-27 13:45:50 -07:00
Hangbin Liu	4249357010	tcp: Add TCP_USER_TIMEOUT negative value check TCP_USER_TIMEOUT is a TCP level socket option that takes an unsigned int. But patch "tcp: Add TCP_USER_TIMEOUT socket option"(`dca43c75`) didn't check the negative values. If a user assign -1 to it, the socket will set successfully and wait for 4294967295 miliseconds. This patch add a negative value check to avoid this issue. Signed-off-by: Hangbin Liu <liuhangbin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-27 13:45:50 -07:00
Linus Torvalds	aa0b3b2bee	Merge branch 'for-3.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/cooloney/linux-leds Pull LED subsystem update from Bryan Wu. * 'for-3.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/cooloney/linux-leds: (50 commits) leds-lp8788: forgotten unlock at lp8788_led_work LEDS: propagate error codes in blinkm_detect() LEDS: memory leak in blinkm_led_common_set() leds: add new lp8788 led driver LEDS: add BlinkM RGB LED driver, documentation and update MAINTAINERS leds: max8997: Simplify max8997_led_set_mode implementation leds/leds-s3c24xx: use devm_gpio_request leds: convert Network Space v2 LED driver to devm_kzalloc() and cleanup error exit path leds: convert DAC124S085 LED driver to devm_kzalloc() leds: convert LM3530 LED driver to devm_kzalloc() and cleanup error exit path leds: convert TCA6507 LED driver to devm_kzalloc() leds: convert Freescale MC13783 LED driver to devm_kzalloc() and cleanup error exit path leds: convert ADP5520 LED driver to devm_kzalloc() and cleanup error exit path leds: convert PCA955x LED driver to devm_kzalloc() and cleanup error exit path leds: convert Sun Fire LED driver to devm_kzalloc() and cleanup error exit path leds: convert PCA9532 LED driver to devm_kzalloc() leds: convert LT3593 LED driver to devm_kzalloc() leds: convert Renesas TPU LED driver to devm_kzalloc() and cleanup error exit path leds: convert LP5523 LED driver to devm_kzalloc() and cleanup error exit path leds: convert PCA9633 LED driver to devm_kzalloc() ...	2012-07-26 20:26:27 -07:00
Linus Torvalds	1e30c1b386	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net Pull networking updates and fixes from David Miller: 1) Reinstate the no-ref optimization for input route lookups in ipv4 to fix some routing cache removal perf regressions. 2) Make TCP socket pre-demux work on ipv6 side too, from Eric Dumazet. 3) Get RX hash value from correct place in be2net driver, from Sarveshwar Bandi. 4) Validation of FIB cached routes missing critical check, from Eric Dumazet. 5) EEH support in mlx4 driver, from Kleber Sacilotto de Souza. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (23 commits) ipv6: Early TCP socket demux ipv4: Fix input route performance regression. pch_gbe: vlan skb len fix pch_gbe: add extra clean tx pch_gbe: fix transmit watchdog timeout ixgbe: fix panic while dumping packets on Tx hang with IOMMU be2net: Fix to parse RSS hash from Receive completions correctly. net/mlx4_en: Limit the RFS filter IDs to be < RPS_NO_FILTER hyperv: Add error handling to rndis_filter_device_add() hyperv: Add a check for ring_size value ipv4: rt_cache_valid must check expired routes net/pch_gpe: Cannot disable ethernet autonegation qeth: repair crash in qeth_l3_vlan_rx_kill_vid() netiucv: cleanup attribute usage net: wiznet add missing HAS_IOMEM dependency be2net: Missing byteswap in be_get_fw_log_level causes oops on PowerPC mlx4: Add support for EEH error recovery cdc-ncm: tag Ericsson WWAN devices (eg F5521gw) with FLAG_WWAN wanmain: comparing array with NULL caif: fix NULL pointer check ...	2012-07-26 18:09:01 -07:00
Eric Dumazet	c7109986db	ipv6: Early TCP socket demux This is the IPv6 missing bits for infrastructure added in commit `41063e9dd1` (ipv4: Early TCP socket demux.) Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-26 15:50:39 -07:00
David S. Miller	c6cffba4ff	ipv4: Fix input route performance regression. With the routing cache removal we lost the "noref" code paths on input, and this can kill some routing workloads. Reinstate the noref path when we hit a cached route in the FIB nexthops. With help from Eric Dumazet. Reported-by: Alexander Duyck <alexander.duyck@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-26 15:50:39 -07:00
Eric Dumazet	4331debc51	ipv4: rt_cache_valid must check expired routes commit `d2d68ba9fe` (ipv4: Cache input routes in fib_info nexthops.) introduced rt_cache_valid() helper. It unfortunately doesn't check if route is expired before caching it. I noticed sk_setup_caps() was constantly called on a tcp workload. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-25 15:24:14 -07:00
Stanislaw Gruszka	5e31fc0815	wireless: reg: restore previous behaviour of chan->max_power calculations commit `eccc068e8e` Author: Hong Wu <Hong.Wu@dspg.com> Date: Wed Jan 11 20:33:39 2012 +0200 wireless: Save original maximum regulatory transmission power for the calucation of the local maximum transmit pow changed the way we calculate chan->max_power as min(chan->max_power, chan->max_reg_power). That broke rt2x00 (and perhaps some other drivers) that do not set chan->max_power. It is not so easy to fix this problem correctly in rt2x00. According to commit `eccc068e8` changelog, change claim only to save maximum regulatory power - changing setting of chan->max_power was side effect. This patch restore previous calculations of chan->max_power and do not touch chan->max_reg_power. Cc: stable@vger.kernel.org # 3.4+ Signed-off-by: Stanislaw Gruszka <sgruszka@redhat.com> Acked-by: Luis R. Rodriguez <mcgrof@qca.qualcomm.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2012-07-25 16:11:12 +02:00
Alan Cox	8b72ff6484	wanmain: comparing array with NULL gcc really should warn about these ! Signed-off-by: Alan Cox <alan@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-24 13:55:21 -07:00
Eric Dumazet	9cb429d692	tcp: early_demux fixes 1) Remove a non needed pskb_may_pull() in tcp_v4_early_demux() and fix a potential bug if skb->head was reallocated (iph & th pointers were not reloaded) TCP stack will pull/check headers anyway. 2) must reload iph in ip_rcv_finish() after early_demux() call since skb->head might have changed. 3) skb->dev->ifindex can be now replaced by skb->skb_iif Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-24 13:54:15 -07:00
Linus Torvalds	d14b7a419a	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial Pull trivial tree from Jiri Kosina: "Trivial updates all over the place as usual." * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (29 commits) Fix typo in include/linux/clk.h . pci: hotplug: Fix typo in pci iommu: Fix typo in iommu video: Fix typo in drivers/video Documentation: Add newline at end-of-file to files lacking one arm,unicore32: Remove obsolete "select MISC_DEVICES" module.c: spelling s/postition/position/g cpufreq: Fix typo in cpufreq driver trivial: typo in comment in mksysmap mach-omap2: Fix typo in debug message and comment scsi: aha152x: Fix sparse warning and make printing pointer address more portable. Change email address for Steve Glendinning Btrfs: fix typo in convert_extent_bit via: Remove bogus if check netprio_cgroup.c: fix comment typo backlight: fix memory leak on obscure error path Documentation: asus-laptop.txt references an obsolete Kconfig item Documentation: ManagementStyle: fixed typo mm/vmscan: cleanup comment error in balance_pgdat mm: cleanup on the comments of zone_reclaim_stat ...	2012-07-24 13:34:56 -07:00
Linus Torvalds	3c4cfadef6	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next Pull networking changes from David S Miller: 1) Remove the ipv4 routing cache. Now lookups go directly into the FIB trie and use prebuilt routes cached there. No more garbage collection, no more rDOS attacks on the routing cache. Instead we now get predictable and consistent performance, no matter what the pattern of traffic we service. This has been almost 2 years in the making. Special thanks to Julian Anastasov, Eric Dumazet, Steffen Klassert, and others who have helped along the way. I'm sure that with a change of this magnitude there will be some kind of fallout, but such things ought the be simple to fix at this point. Luckily I'm not European so I'll be around all of August to fix things :-) The major stages of this work here are each fronted by a forced merge commit whose commit message contains a top-level description of the motivations and implementation issues. 2) Pre-demux of established ipv4 TCP sockets, saves a route demux on input. 3) TCP SYN/ACK performance tweaks from Eric Dumazet. 4) Add namespace support for netfilter L4 conntrack helpers, from Gao Feng. 5) Add config mechanism for Energy Efficient Ethernet to ethtool, from Yuval Mintz. 6) Remove quadratic behavior from /proc/net/unix, from Eric Dumazet. 7) Support for connection tracker helpers in userspace, from Pablo Neira Ayuso. 8) Allow userspace driven TX load balancing functions in TEAM driver, from Jiri Pirko. 9) Kill off NLMSG_PUT and RTA_PUT macros, more gross stuff with embedded gotos. 10) TCP Small Queues, essentially minimize the amount of TCP data queued up in the packet scheduler layer. Whereas the existing BQL (Byte Queue Limits) limits the pkt_sched --> netdevice queuing levels, this controls the TCP --> pkt_sched queueing levels. From Eric Dumazet. 11) Reduce the number of get_page/put_page ops done on SKB fragments, from Alexander Duyck. 12) Implement protection against blind resets in TCP (RFC 5961), from Eric Dumazet. 13) Support the client side of TCP Fast Open, basically the ability to send data in the SYN exchange, from Yuchung Cheng. Basically, the sender queues up data with a sendmsg() call using MSG_FASTOPEN, then they do the connect() which emits the queued up fastopen data. 14) Avoid all the problems we get into in TCP when timers or PMTU events hit a locked socket. The TCP Small Queues changes added a tcp_release_cb() that allows us to queue work up to the release_sock() caller, and that's what we use here too. From Eric Dumazet. 15) Zero copy on TX support for TUN driver, from Michael S. Tsirkin. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1870 commits) genetlink: define lockdep_genl_is_held() when CONFIG_LOCKDEP r8169: revert "add byte queue limit support". ipv4: Change rt->rt_iif encoding. net: Make skb->skb_iif always track skb->dev ipv4: Prepare for change of rt->rt_iif encoding. ipv4: Remove all RTCF_DIRECTSRC handliing. ipv4: Really ignore ICMP address requests/replies. decnet: Don't set RTCF_DIRECTSRC. net/ipv4/ip_vti.c: Fix __rcu warnings detected by sparse. ipv4: Remove redundant assignment rds: set correct msg_namelen openvswitch: potential NULL deref in sample() tcp: dont drop MTU reduction indications bnx2x: Add new 57840 device IDs tcp: avoid oops in tcp_metrics and reset tcpm_stamp niu: Change niu_rbr_fill() to use unlikely() to check niu_rbr_add_page() return value niu: Fix to check for dma mapping errors. net: Fix references to out-of-scope variables in put_cmsg_compat() net: ethernet: davinci_emac: add pm_runtime support net: ethernet: davinci_emac: Remove unnecessary #include ...	2012-07-24 10:01:50 -07:00
Johannes Berg	3aa569c3fe	mac80211: fix scan_sdata assignment We need to use RCU to assign scan_sdata. Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2012-07-24 16:54:11 +02:00
WANG Cong	320f5ea0ce	genetlink: define lockdep_genl_is_held() when CONFIG_LOCKDEP lockdep_is_held() is defined when CONFIG_LOCKDEP, not CONFIG_PROVE_LOCKING. Cc: "David S. Miller" <davem@davemloft.net> Cc: Jesse Gross <jesse@nicira.com> Signed-off-by: WANG Cong <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-24 00:01:30 -07:00
Shuah Khan	19cd67e2d5	leds: Rename led_brightness_set() to led_set_brightness() Rename leds external interface led_brightness_set() to led_set_brightness(). This is the second phase of the change to reduce confusion between the leds internal and external interfaces that set brightness. With this change, now the external interface is led_set_brightness(). The first phase renamed the internal interface led_set_brightness() to __led_set_brightness(). There are no changes to the interface implementations. Signed-off-by: Shuah Khan <shuahkhan@gmail.com> Signed-off-by: Bryan Wu <bryan.wu@canonical.com>	2012-07-24 07:52:34 +08:00
David S. Miller	13378cad02	ipv4: Change rt->rt_iif encoding. On input packet processing, rt->rt_iif will be zero if we should use skb->dev->ifindex. Since we access rt->rt_iif consistently via inet_iif(), that is the only spot whose interpretation have to adjust. Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-23 16:36:27 -07:00
David S. Miller	b68581778c	net: Make skb->skb_iif always track skb->dev Make it follow device decapsulation, from things such as VLAN and bonding. The stuff that actually cares about pre-demuxed device pointers, is handled by the "orig_dev" variable in __netif_receive_skb(). And the only consumer of that is the po->origdev feature of AF_PACKET sockets. Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-23 16:36:27 -07:00
David S. Miller	92101b3b2e	ipv4: Prepare for change of rt->rt_iif encoding. Use inet_iif() consistently, and for TCP record the input interface of cached RX dst in inet sock. rt->rt_iif is going to be encoded differently, so that we can legitimately cache input routes in the FIB info more aggressively. When the input interface is "use SKB device index" the rt->rt_iif will be set to zero. This forces us to move the TCP RX dst cache installation into the ipv4 specific code, and as well it should since doing the route caching for ipv6 is pointless at the moment since it is not inspected in the ipv6 input paths yet. Also, remove the unlikely on dst->obsolete, all ipv4 dsts have obsolete set to a non-zero value to force invocation of the check callback. Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-23 16:36:26 -07:00
David S. Miller	fe3edf4579	ipv4: Remove all RTCF_DIRECTSRC handliing. The last and final kernel user, ICMP address replies, has been removed. Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-23 13:22:20 -07:00
David S. Miller	838942a594	ipv4: Really ignore ICMP address requests/replies. Alexey removed kernel side support for requests, and the only thing we do for replies is log a message if something doesn't look right. As Alexey's comment indicates, this belongs in userspace (if anywhere), and thus we can safely just get rid of this code. Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-23 13:20:26 -07:00
David S. Miller	8acfaa9484	decnet: Don't set RTCF_DIRECTSRC. It's an ipv4 defined route flag, and only ipv4 uses it. Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-23 13:16:59 -07:00
Saurabh	e7d4b18cbe	net/ipv4/ip_vti.c: Fix __rcu warnings detected by sparse. With CONFIG_SPARSE_RCU_POINTER=y sparse identified references which did not specificy __rcu in ip_vti.c Signed-off-by: Saurabh Mohan <saurabh.mohan@vyatta.com> Reported-by: Fengguang Wu <fengguang.wu@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-23 13:00:54 -07:00
Lin Ming	8fe5cb873b	ipv4: Remove redundant assignment It is redundant to set no_addr and accept_local to 0 and then set them with other values just after that. Signed-off-by: Lin Ming <mlin@ss.pku.edu.cn> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-23 13:00:54 -07:00
Linus Torvalds	a66d2c8f7e	Merge branch 'for-linus-2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull the big VFS changes from Al Viro: "This one is big and changes quite a few things around VFS. What's in there: - the first of two really major architecture changes - death to open intents. The former is finally there; it was very long in making, but with Miklos getting through really hard and messy final push in fs/namei.c, we finally have it. Unlike his variant, this one doesn't introduce struct opendata; what we have instead is ->atomic_open() taking preallocated struct file * and passing everything via its fields. Instead of returning struct file , it returns -E... on error, 0 on success and 1 in "deal with it yourself" case (e.g. symlink found on server, etc.). See comments before fs/namei.c:atomic_open(). That made a lot of goodies finally possible and quite a few are in that pile: ->lookup(), ->d_revalidate() and ->create() do not get struct nameidata anymore; ->lookup() and ->d_revalidate() get lookup flags instead, ->create() gets "do we want it exclusive" flag. With the introduction of new helper (kern_path_locked()) we are rid of all struct nameidata instances outside of fs/namei.c; it's still visible in namei.h, but not for long. Come the next cycle, declaration will move either to fs/internal.h or to fs/namei.c itself. [me, miklos, hch] - The second major change: behaviour of final fput(). Now we have __fput() done without any locks held by caller and not from deep in call stack. That obviously lifts a lot of constraints on the locking in there. Moreover, it's legal now to call fput() from atomic contexts (which has immediately simplified life for aio.c). We also don't need anti-recursion logics in __scm_destroy() anymore. There is a price, though - the damn thing has become partially asynchronous. For fput() from normal process we are guaranteed that pending __fput() will be done before the caller returns to userland, exits or gets stopped for ptrace. For kernel threads and atomic contexts it's done via schedule_work(), so theoretically we might need a way to make sure it's finished; so far only one such place had been found, but there might be more. There's flush_delayed_fput() (do all pending __fput()) and there's __fput_sync() (fput() analog doing __fput() immediately). I hope we won't need them often; see warnings in fs/file_table.c for details. [me, based on task_work series from Oleg merged last cycle] - sync series from Jan - large part of "death to sync_supers()" work from Artem; the only bits missing here are exofs and ext4 ones. As far as I understand, those are going via the exofs and ext4 trees resp.; once they are in, we can put ->write_super() to the rest, along with the thread calling it. - preparatory bits from unionmount series (from dhowells). - assorted cleanups and fixes all over the place, as usual. This is not the last pile for this cycle; there's at least jlayton's ESTALE work and fsfreeze series (the latter - in dire need of fixes, so I'm not sure it'll make the cut this cycle). I'll probably throw symlink/hardlink restrictions stuff from Kees into the next pile, too. Plus there's a lot of misc patches I hadn't thrown into that one - it's large enough as it is..." * 'for-linus-2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (127 commits) ext4: switch EXT4_IOC_RESIZE_FS to mnt_want_write_file() btrfs: switch btrfs_ioctl_balance() to mnt_want_write_file() switch dentry_open() to struct path, make it grab references itself spufs: shift dget/mntget towards dentry_open() zoran: don't bother with struct file * in zoran_map ecryptfs: don't reinvent the wheels, please - use struct completion don't expose I_NEW inodes via dentry->d_inode tidy up namei.c a bit unobfuscate follow_up() a bit ext3: pass custom EOF to generic_file_llseek_size() ext4: use core vfs llseek code for dir seeks vfs: allow custom EOF in generic_file_llseek code vfs: Avoid unnecessary WB_SYNC_NONE writeback during sys_sync and reorder sync passes vfs: Remove unnecessary flushing of block devices vfs: Make sys_sync writeout also block device inodes vfs: Create function for iterating over block devices vfs: Reorder operations during sys_sync quota: Move quota syncing to ->sync_fs method quota: Split dquot_quota_sync() to writeback and cache flushing part vfs: Move noop_backing_dev_info check from sync into writeback ...	2012-07-23 12:27:27 -07:00
Weiping Pan	06b6a1cf6e	rds: set correct msg_namelen Jay Fenlason (fenlason@redhat.com) found a bug, that recvfrom() on an RDS socket can return the contents of random kernel memory to userspace if it was called with a address length larger than sizeof(struct sockaddr_in). rds_recvmsg() also fails to set the addr_len paramater properly before returning, but that's just a bug. There are also a number of cases wher recvfrom() can return an entirely bogus address. Anything in rds_recvmsg() that returns a non-negative value but does not go through the "sin = (struct sockaddr_in )msg->msg_name;" code path at the end of the while(1) loop will return up to 128 bytes of kernel memory to userspace. And I write two test programs to reproduce this bug, you will see that in rds_server, fromAddr will be overwritten and the following sock_fd will be destroyed. Yes, it is the programmer's fault to set msg_namelen incorrectly, but it is better to make the kernel copy the real length of address to user space in such case. How to run the test programs ? I test them on 32bit x86 system, 3.5.0-rc7. 1 compile gcc -o rds_client rds_client.c gcc -o rds_server rds_server.c 2 run ./rds_server on one console 3 run ./rds_client on another console 4 you will see something like: server is waiting to receive data... old socket fd=3 server received data from client:data from client msg.msg_namelen=32 new socket fd=-1067277685 sendmsg() : Bad file descriptor /************** rds_client.c *****************/ int main(void) { int sock_fd; struct sockaddr_in serverAddr; struct sockaddr_in toAddr; char recvBuffer[128] = "data from client"; struct msghdr msg; struct iovec iov; sock_fd = socket(AF_RDS, SOCK_SEQPACKET, 0); if (sock_fd < 0) { perror("create socket error\n"); exit(1); } memset(&serverAddr, 0, sizeof(serverAddr)); serverAddr.sin_family = AF_INET; serverAddr.sin_addr.s_addr = inet_addr("127.0.0.1"); serverAddr.sin_port = htons(4001); if (bind(sock_fd, (struct sockaddr)&serverAddr, sizeof(serverAddr)) < 0) { perror("bind() error\n"); close(sock_fd); exit(1); } memset(&toAddr, 0, sizeof(toAddr)); toAddr.sin_family = AF_INET; toAddr.sin_addr.s_addr = inet_addr("127.0.0.1"); toAddr.sin_port = htons(4000); msg.msg_name = &toAddr; msg.msg_namelen = sizeof(toAddr); msg.msg_iov = &iov; msg.msg_iovlen = 1; msg.msg_iov->iov_base = recvBuffer; msg.msg_iov->iov_len = strlen(recvBuffer) + 1; msg.msg_control = 0; msg.msg_controllen = 0; msg.msg_flags = 0; if (sendmsg(sock_fd, &msg, 0) == -1) { perror("sendto() error\n"); close(sock_fd); exit(1); } printf("client send data:%s\n", recvBuffer); memset(recvBuffer, '\0', 128); msg.msg_name = &toAddr; msg.msg_namelen = sizeof(toAddr); msg.msg_iov = &iov; msg.msg_iovlen = 1; msg.msg_iov->iov_base = recvBuffer; msg.msg_iov->iov_len = 128; msg.msg_control = 0; msg.msg_controllen = 0; msg.msg_flags = 0; if (recvmsg(sock_fd, &msg, 0) == -1) { perror("recvmsg() error\n"); close(sock_fd); exit(1); } printf("receive data from server:%s\n", recvBuffer); close(sock_fd); return 0; } /*************** rds_server.c *****************/ int main(void) { struct sockaddr_in fromAddr; int sock_fd; struct sockaddr_in serverAddr; unsigned int addrLen; char recvBuffer[128]; struct msghdr msg; struct iovec iov; sock_fd = socket(AF_RDS, SOCK_SEQPACKET, 0); if(sock_fd < 0) { perror("create socket error\n"); exit(0); } memset(&serverAddr, 0, sizeof(serverAddr)); serverAddr.sin_family = AF_INET; serverAddr.sin_addr.s_addr = inet_addr("127.0.0.1"); serverAddr.sin_port = htons(4000); if (bind(sock_fd, (struct sockaddr)&serverAddr, sizeof(serverAddr)) < 0) { perror("bind error\n"); close(sock_fd); exit(1); } printf("server is waiting to receive data...\n"); msg.msg_name = &fromAddr; /* * I add 16 to sizeof(fromAddr), ie 32, * and pay attention to the definition of fromAddr, * recvmsg() will overwrite sock_fd, * since kernel will copy 32 bytes to userspace. * * If you just use sizeof(fromAddr), it works fine. * / msg.msg_namelen = sizeof(fromAddr) + 16; / msg.msg_namelen = sizeof(fromAddr); */ msg.msg_iov = &iov; msg.msg_iovlen = 1; msg.msg_iov->iov_base = recvBuffer; msg.msg_iov->iov_len = 128; msg.msg_control = 0; msg.msg_controllen = 0; msg.msg_flags = 0; while (1) { printf("old socket fd=%d\n", sock_fd); if (recvmsg(sock_fd, &msg, 0) == -1) { perror("recvmsg() error\n"); close(sock_fd); exit(1); } printf("server received data from client:%s\n", recvBuffer); printf("msg.msg_namelen=%d\n", msg.msg_namelen); printf("new socket fd=%d\n", sock_fd); strcat(recvBuffer, "--data from server"); if (sendmsg(sock_fd, &msg, 0) == -1) { perror("sendmsg()\n"); close(sock_fd); exit(1); } } close(sock_fd); return 0; } Signed-off-by: Weiping Pan <wpan@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-23 01:01:44 -07:00
Dan Carpenter	5b3e7e6cb5	openvswitch: potential NULL deref in sample() If there is no OVS_SAMPLE_ATTR_ACTIONS set then "acts_list" is NULL and it leads to a NULL dereference when we call nla_len(acts_list). This is a static checker fix, not something I have seen in testing. Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-23 00:59:54 -07:00
Eric Dumazet	563d34d057	tcp: dont drop MTU reduction indications ICMP messages generated in output path if frame length is bigger than mtu are actually lost because socket is owned by user (doing the xmit) One example is the ipgre_tunnel_xmit() calling icmp_send(skb, ICMP_DEST_UNREACH, ICMP_FRAG_NEEDED, htonl(mtu)); We had a similar case fixed in commit `a34a101e1e` (ipv6: disable GSO on sockets hitting dst_allfrag). Problem of such fix is that it relied on retransmit timers, so short tcp sessions paid a too big latency increase price. This patch uses the tcp_release_cb() infrastructure so that MTU reduction messages (ICMP messages) are not lost, and no extra delay is added in TCP transmits. Reported-by: Maciej Żenczykowski <maze@google.com> Diagnosed-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Nandita Dukkipati <nanditad@google.com> Cc: Tom Herbert <therbert@google.com> Cc: Tore Anderson <tore@fud.no> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-23 00:58:46 -07:00
Julian Anastasov	9a0a9502cb	tcp: avoid oops in tcp_metrics and reset tcpm_stamp In tcp_tw_remember_stamp we incorrectly checked tw instead of tm, it can lead to oops if the cached entry is not found. tcpm_stamp was not updated in tcpm_check_stamp when tcpm_suck_dst was called, move the update into tcpm_suck_dst, so that we do not call it infinitely on every next cache hit after TCP_METRICS_TIMEOUT. Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-23 00:57:12 -07:00
Jesper Juhl	818810472b	net: Fix references to out-of-scope variables in put_cmsg_compat() In net/compat.c::put_cmsg_compat() we may assign 'data' the address of either the 'ctv' or 'cts' local variables inside the 'if (!COMPAT_USE_64BIT_TIME)' branch. Those variables go out of scope at the end of the 'if' statement, so when we use 'data' further down in 'copy_to_user(CMSG_COMPAT_DATA(cm), data, cmlen - sizeof(struct compat_cmsghdr))' there's no telling what it may be refering to - not good. Fix the problem by simply giving 'ctv' and 'cts' function scope. Signed-off-by: Jesper Juhl <jj@chaosbits.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-22 17:50:49 -07:00
David S. Miller	5e9965c15b	Merge branch 'kill_rtcache' The ipv4 routing cache is non-deterministic, performance wise, and is subject to reasonably easy to launch denial of service attacks. The routing cache works great for well behaved traffic, and the world was a much friendlier place when the tradeoffs that led to the routing cache's design were considered. What it boils down to is that the performance of the routing cache is a product of the traffic patterns seen by a system rather than being a product of the contents of the routing tables. The former of which is controllable by external entitites. Even for "well behaved" legitimate traffic, high volume sites can see hit rates in the routing cache of only ~%10. The general flow of this patch series is that first the routing cache is removed. We build a completely new rtable entry every lookup request. Next we make some simplifications due to the fact that removing the routing cache causes several members of struct rtable to become no longer necessary. Then we need to make some amends such that we can legally cache pre-constructed routes in the FIB nexthops. Firstly, we need to invalidate routes which are hit with nexthop exceptions. Secondly we have to change the semantics of rt->rt_gateway such that zero means that the destination is on-link and non-zero otherwise. Now that the preparations are ready, we start caching precomputed routes in the FIB nexthops. Output and input routes need different kinds of care when determining if we can legally do such caching or not. The details are in the commit log messages for those changes. The patch series then winds down with some more struct rtable simplifications and other tidy ups that remove unnecessary overhead. On a SPARC-T3 output route lookups are ~876 cycles. Input route lookups are ~1169 cycles with rpfilter disabled, and about ~1468 cycles with rpfilter enabled. These measurements were taken with the kbench_mod test module in the net_test_tools GIT tree: git://git.kernel.org/pub/scm/linux/kernel/git/davem/net_test_tools.git That GIT tree also includes a udpflood tester tool and stresses route lookups on packet output. For example, on the same SPARC-T3 system we can run: time ./udpflood -l 10000000 10.2.2.11 with routing cache: real 1m21.955s user 0m6.530s sys 1m15.390s without routing cache: real 1m31.678s user 0m6.520s sys 1m25.140s Performance undoubtedly can easily be improved further. For example fib_table_lookup() performs a lot of excessive computations with all the masking and shifting, some of it conditionalized to deal with edge cases. Also, Eric's no-ref optimization for input route lookups can be re-instated for the FIB nexthop caching code path. I would be really pleased if someone would work on that. In fact anyone suitable motivated can just fire up perf on the loading of the test net_test_tools benchmark kernel module. I spend much of my time going: bash# perf record insmod ./kbench_mod.ko dst=172.30.42.22 src=74.128.0.1 iif=2 bash# perf report Thanks to helpful feedback from Joe Perches, Eric Dumazet, Ben Hutchings, and others. Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-22 17:04:15 -07:00
Al Viro	6120d3dbb1	get rid of ->scm_work_list recursion in __scm_destroy() will be cut by delaying final fput() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2012-07-22 23:58:00 +04:00
John Fastabend	406a3c638c	net: netprio_cgroup: rework update socket logic Instead of updating the sk_cgrp_prioidx struct field on every send this only updates the field when a task is moved via cgroup infrastructure. This allows sockets that may be used by a kernel worker thread to be managed. For example in the iscsi case today a user can put iscsid in a netprio cgroup and control traffic will be sent with the correct sk_cgrp_prioidx value set but as soon as data is sent the kernel worker thread isssues a send and sk_cgrp_prioidx is updated with the kernel worker threads value which is the default case. It seems more correct to only update the field when the user explicitly sets it via control group infrastructure. This allows the users to manage sockets that may be used with other threads. Signed-off-by: John Fastabend <john.r.fastabend@intel.com> Acked-by: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-22 12:44:01 -07:00
Michael S. Tsirkin	dcc0fb782b	skbuff: export skb_copy_ubufs Export skb_copy_ubufs so that modules can orphan frags. Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-22 12:39:33 -07:00
Michael S. Tsirkin	1080e512d4	net: orphan frags on receive zero copy packets are normally sent to the outside network, but bridging, tun etc might loop them back to host networking stack. If this happens destructors will never be called, so orphan the frags immediately on receive. Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-22 12:39:33 -07:00
Michael S. Tsirkin	70008aa50e	skbuff: convert to skb_orphan_frags Reduce code duplication a bit using the new helper. Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-22 12:39:33 -07:00
Mark A. Greer	1d69c2b343	rtnl: Add #ifdef CONFIG_RPS around num_rx_queues reference Commit `76ff5cc919` (rtnl: allow to specify number of rx and tx queues on device creation) added a reference to the net_device structure's 'num_rx_queues' member in net/core/rtnetlink.c:rtnl_fill_ifinfo() However, the definition for 'num_rx_queues' is surrounded by an '#ifdef CONFIG_RPS' while the new reference to it is not. This causes a compile error when CONFIG_RPS is not defined. Fix the compile error by surrounding the new reference to 'num_rx_queues' by an '#ifdef CONFIG_RPS'. CC: Jiri Pirko <jiri@resnulli.us> Signed-off-by: Mark A. Greer <mgreer@animalcreek.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-22 12:28:11 -07:00
Neil Horman	5aa93bcf66	sctp: Implement quick failover draft from tsvwg I've seen several attempts recently made to do quick failover of sctp transports by reducing various retransmit timers and counters. While its possible to implement a faster failover on multihomed sctp associations, its not particularly robust, in that it can lead to unneeded retransmits, as well as false connection failures due to intermittent latency on a network. Instead, lets implement the new ietf quick failover draft found here: http://tools.ietf.org/html/draft-nishida-tsvwg-sctp-failover-05 This will let the sctp stack identify transports that have had a small number of errors, and avoid using them quickly until their reliability can be re-established. I've tested this out on two virt guests connected via multiple isolated virt networks and believe its in compliance with the above draft and works well. Signed-off-by: Neil Horman <nhorman@tuxdriver.com> CC: Vlad Yasevich <vyasevich@gmail.com> CC: Sridhar Samudrala <sri@us.ibm.com> CC: "David S. Miller" <davem@davemloft.net> CC: linux-sctp@vger.kernel.org CC: joe@perches.com Acked-by: Vlad Yasevich <vyasevich@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-22 12:13:46 -07:00
Kevin Groeneveld	e3906486f6	net: fix race condition in several drivers when reading stats Fix race condition in several network drivers when reading stats on 32bit UP architectures. These drivers update their stats in a BH context and therefore should use u64_stats_fetch_begin_bh/u64_stats_fetch_retry_bh instead of u64_stats_fetch_begin/u64_stats_fetch_retry when reading the stats. Signed-off-by: Kevin Groeneveld <kgroeneveld@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-22 12:12:32 -07:00
Eric Dumazet	0980e56e50	ipv4: tcp: set unicast_sock uc_ttl to -1 Set unicast_sock uc_ttl to -1 so that we select the right ttl, instead of sending packets with a 0 ttl. Bug added in commit `be9f4a44e7` (ipv4: tcp: remove per net tcp_sock) Signed-off-by: Hiroaki SHIMODA <shimoda.hiroaki@gmail.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-22 12:06:21 -07:00
David S. Miller	c073cfc89f	Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/jesse/openvswitch Jesse Gross says: ==================== A few bug fixes and small enhancements for net-next/3.6. ... Ansis Atteka (1): openvswitch: Do not send notification if ovs_vport_set_options() failed Ben Pfaff (1): openvswitch: Check gso_type for correct sk_buff in queue_gso_packets(). Jesse Gross (2): openvswitch: Enable retrieval of TCP flags from IPv6 traffic. openvswitch: Reset upper layer protocol info on internal devices. Leo Alterman (1): openvswitch: Fix typo in documentation. Pravin B Shelar (1): openvswitch: Check currect return value from skb_gso_segment() Raju Subramanian (1): openvswitch: Replace Nicira Networks. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-20 16:16:34 -07:00
Ben Pfaff	a1b5d0dd28	openvswitch: Check gso_type for correct sk_buff in queue_gso_packets(). At the point where it was used, skb_shinfo(skb)->gso_type referred to a post-GSO sk_buff. Thus, it would always be 0. We want to know the pre-GSO gso_type, so we need to obtain it before segmenting. Before this change, the kernel would pass inconsistent data to userspace: packets for UDP fragments with nonzero offset would be passed along with flow keys that indicate a zero offset (that is, the flow key for "later" fragments claimed to be "first" fragments). This inconsistency tended to confuse Open vSwitch userspace, causing it to log messages about "failed to flow_del" the flows with "later" fragments. Signed-off-by: Ben Pfaff <blp@nicira.com> Signed-off-by: Jesse Gross <jesse@nicira.com>	2012-07-20 14:47:54 -07:00
Pravin B Shelar	92e5dfc34c	openvswitch: Check currect return value from skb_gso_segment() Fix return check typo. Signed-off-by: Pravin B Shelar <pshelar@nicira.com> Signed-off-by: Jesse Gross <jesse@nicira.com>	2012-07-20 14:46:29 -07:00
David S. Miller	2860583fe8	ipv4: Kill rt->fi It's not really needed. We only grabbed a reference to the fib_info for the sake of fib_info local metrics. However, fib_info objects are freed using RCU, as are therefore their private metrics (if any). We would have triggered a route cache flush if we eliminated a reference to a fib_info object in the routing tables. Therefore, any existing cached routes will first check and see that they have been invalidated before an errant reference to these metric values would occur. Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-20 13:40:07 -07:00
David S. Miller	9917e1e876	ipv4: Turn rt->rt_route_iif into rt->rt_is_input. That is this value's only use, as a boolean to indicate whether a route is an input route or not. So implement it that way, using a u16 gap present in the struct already. Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-20 13:40:02 -07:00
David S. Miller	4fd551d7be	ipv4: Kill rt->rt_oif Never actually used. It was being set on output routes to the original OIF specified in the flow key used for the lookup. Adjust the only user, ipmr_rt_fib_lookup(), for greater correctness of the flowi4_oif and flowi4_iif values, thanks to feedback from Julian Anastasov. Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-20 13:38:34 -07:00
David S. Miller	93ac53410a	ipv4: Dirty less cache lines in route caching paths. Don't bother incrementing dst->__use and setting dst->lastuse, they are completely pointless and just slow things down. Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-20 13:36:55 -07:00
David S. Miller	ba3f7f04ef	ipv4: Kill FLOWI_FLAG_RT_NOCACHE and associated code. Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-20 13:36:54 -07:00
David S. Miller	d2d68ba9fe	ipv4: Cache input routes in fib_info nexthops. Caching input routes is slightly simpler than output routes, since we don't need to be concerned with nexthop exceptions. (locally destined, and routed packets, never trigger PMTU events or redirects that will be processed by us). However, we have to elide caching for the DIRECTSRC and non-zero itag cases. Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-20 13:36:40 -07:00
David S. Miller	f2bb4bedf3	ipv4: Cache output routes in fib_info nexthops. If we have an output route that lacks nexthop exceptions, we can cache it in the FIB info nexthop. Such routes will have DST_HOST cleared because such routes refer to a family of destinations, rather than just one. The sequence of the handling of exceptions during route lookup is adjusted to make the logic work properly. Before we allocate the route, we lookup the exception. Then we know if we will cache this route or not, and therefore whether DST_HOST should be set on the allocated route. Then we use DST_HOST to key off whether we should store the resulting route, during rt_set_nexthop(), in the FIB nexthop cache. With help from Eric Dumazet. Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-20 13:36:16 -07:00
David S. Miller	ceb3320610	ipv4: Kill routes during PMTU/redirect updates. Mark them obsolete so there will be a re-lookup to fetch the FIB nexthop exception info. Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-20 13:31:22 -07:00
David S. Miller	f5b0a87436	net: Document dst->obsolete better. Add a big comment explaining how the field works, and use defines instead of magic constants for the values assigned to it. Suggested by Joe Perches. Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-20 13:31:21 -07:00
David S. Miller	f8126f1d51	ipv4: Adjust semantics of rt->rt_gateway. In order to allow prefixed routes, we have to adjust how rt_gateway is set and interpreted. The new interpretation is: 1) rt_gateway == 0, destination is on-link, nexthop is iph->daddr 2) rt_gateway != 0, destination requires a nexthop gateway Abstract the fetching of the proper nexthop value using a new inline helper, rt_nexthop(), as suggested by Joe Perches. Signed-off-by: David S. Miller <davem@davemloft.net> Tested-by: Vijay Subramanian <subramanian.vijay@gmail.com>	2012-07-20 13:31:20 -07:00
David S. Miller	f1ce3062c5	ipv4: Remove 'rt_dst' from 'struct rtable' Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-20 13:31:19 -07:00
David Miller	b48698895d	ipv4: Remove 'rt_mark' from 'struct rtable' Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-20 13:31:18 -07:00
David Miller	d6c0a4f609	ipv4: Kill 'rt_src' from 'struct rtable' Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-20 13:31:00 -07:00
David Miller	1a00fee4ff	ipv4: Remove rt_key_{src,dst,tos} from struct rtable. They are always used in contexts where they can be reconstituted, or where the finally resolved rt->rt_{src,dst} is semantically equivalent. Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-20 13:30:59 -07:00
David Miller	38a424e465	ipv4: Kill ip_route_input_noref(). The "noref" argument to ip_route_input_common() is now always ignored because we do not cache routes, and in that case we must always grab a reference to the resulting 'dst'. Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-20 13:30:59 -07:00
David S. Miller	89aef8921b	ipv4: Delete routing cache. The ipv4 routing cache is non-deterministic, performance wise, and is subject to reasonably easy to launch denial of service attacks. The routing cache works great for well behaved traffic, and the world was a much friendlier place when the tradeoffs that led to the routing cache's design were considered. What it boils down to is that the performance of the routing cache is a product of the traffic patterns seen by a system rather than being a product of the contents of the routing tables. The former of which is controllable by external entitites. Even for "well behaved" legitimate traffic, high volume sites can see hit rates in the routing cache of only ~%10. Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-20 13:30:27 -07:00
Mikulas Patocka	b09e786bd1	tun: fix a crash bug and a memory leak This patch fixes a crash tun_chr_close -> netdev_run_todo -> tun_free_netdev -> sk_release_kernel -> sock_release -> iput(SOCK_INODE(sock)) introduced by commit `1ab5ecb90c` The problem is that this socket is embedded in struct tun_struct, it has no inode, iput is called on invalid inode, which modifies invalid memory and optionally causes a crash. sock_release also decrements sockets_in_use, this causes a bug that "sockets: used" field in /proc/*/net/sockstat keeps on decreasing when creating and closing tun devices. This patch introduces a flag SOCK_EXTERNALLY_ALLOCATED that instructs sock_release to not free the inode and not decrement sockets_in_use, fixing both memory corruption and sockets_in_use underflow. It should be backported to 3.3 an 3.4 stabke. Signed-off-by: Mikulas Patocka <mikulas@artax.karlin.mff.cuni.cz> Cc: stable@kernel.org Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-20 11:21:06 -07:00
Julian Anastasov	521f549097	ipv4: show pmtu in route list Override the metrics with rt_pmtu Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-20 11:16:49 -07:00
Jiri Pirko	76ff5cc919	rtnl: allow to specify number of rx and tx queues on device creation This patch introduces IFLA_NUM_TX_QUEUES and IFLA_NUM_RX_QUEUES by which userspace can set number of rx and/or tx queues to be allocated for newly created netdevice. This overrides ops->get_num_[tr]x_queues() Signed-off-by: Jiri Pirko <jiri@resnulli.us> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-20 11:07:00 -07:00
Jiri Pirko	d40156aa5e	rtnl: allow to specify different num for rx and tx queue count Also cut out unused function parameters and possible err in return value. Signed-off-by: Jiri Pirko <jiri@resnulli.us> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-20 11:06:59 -07:00
Eric Dumazet	6f458dfb40	tcp: improve latencies of timer triggered events Modern TCP stack highly depends on tcp_write_timer() having a small latency, but current implementation doesn't exactly meet the expectations. When a timer fires but finds the socket is owned by the user, it rearms itself for an additional delay hoping next run will be more successful. tcp_write_timer() for example uses a 50ms delay for next try, and it defeats many attempts to get predictable TCP behavior in term of latencies. Use the recently introduced tcp_release_cb(), so that the user owning the socket will call various handlers right before socket release. This will permit us to post a followup patch to address the tcp_tso_should_defer() syndrome (some deferred packets have to wait RTO timer to be transmitted, while cwnd should allow us to send them sooner) Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Tom Herbert <therbert@google.com> Cc: Yuchung Cheng <ycheng@google.com> Cc: Neal Cardwell <ncardwell@google.com> Cc: Nandita Dukkipati <nanditad@google.com> Cc: H.K. Jerry Chu <hkchu@google.com> Cc: John Heffner <johnwheffner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-20 10:59:41 -07:00
Eric Dumazet	9dc274151a	tcp: fix ABC in tcp_slow_start() When/if sysctl_tcp_abc > 1, we expect to increase cwnd by 2 if the received ACK acknowledges more than 2*MSS bytes, in tcp_slow_start() Problem is this RFC 3465 statement is not correctly coded, as the while () loop increases snd_cwnd one by one. Add a new variable to avoid this off-by one error. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Tom Herbert <therbert@google.com> Cc: Yuchung Cheng <ycheng@google.com> Cc: Neal Cardwell <ncardwell@google.com> Cc: Nandita Dukkipati <nanditad@google.com> Cc: John Heffner <johnwheffner@gmail.com> Cc: Stephen Hemminger <shemminger@vyatta.com> Acked-by: Yuchung Cheng <ycheng@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-20 10:59:41 -07:00
Eric Dumazet	5815d5e7aa	tcp: use hash_32() in tcp_metrics Fix a missing roundup_pow_of_two(), since tcpmhash_entries is not guaranteed to be a power of two. Uses hash_32() instead of custom hash. tcpmhash_entries should be an unsigned int. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-20 10:59:41 -07:00
Vijay Subramanian	67b95bd78f	tcp: Return bool instead of int where appropriate Applied to a set of static inline functions in tcp_input.c Signed-off-by: Vijay Subramanian <subramanian.vijay@gmail.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-20 10:59:41 -07:00
John W. Linville	90b90f60c4	Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/linville/wireless-next into for-davem	2012-07-20 12:30:48 -04:00
Linus Torvalds	85efc72a02	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client Pull last minute Ceph fixes from Sage Weil: "The important one fixes a bug in the socket failure handling behavior that was turned up in some recent failure injection testing. The other two are minor bug fixes." * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: rbd: endian bug in rbd_req_cb() rbd: Fix ceph_snap_context size calculation libceph: fix messenger retry	2012-07-19 16:11:28 -07:00
Julian Anastasov	f31fd38382	ipv4: Fix again the time difference calculation Fix again the diff value in rt_bind_exception after collision of two latest patches, my original commit actually fixed the same problem. Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-19 13:01:44 -07:00
David S. Miller	abaa72d7fd	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net Conflicts: drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c	2012-07-19 11:17:30 -07:00
Yuchung Cheng	67da22d23f	net-tcp: Fast Open client - cookie-less mode In trusted networks, e.g., intranet, data-center, the client does not need to use Fast Open cookie to mitigate DoS attacks. In cookie-less mode, sendmsg() with MSG_FASTOPEN flag will send SYN-data regardless of cookie availability. Signed-off-by: Yuchung Cheng <ycheng@google.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-19 11:02:03 -07:00
Yuchung Cheng	aab4874355	net-tcp: Fast Open client - detecting SYN-data drops On paths with firewalls dropping SYN with data or experimental TCP options, Fast Open connections will have experience SYN timeout and bad performance. The solution is to track such incidents in the cookie cache and disables Fast Open temporarily. Since only the original SYN includes data and/or Fast Open option, the SYN-ACK has some tell-tale sign (tcp_rcv_fastopen_synack()) to detect such drops. If a path has recurring Fast Open SYN drops, Fast Open is disabled for 2^(recurring_losses) minutes starting from four minutes up to roughly one and half day. sendmsg with MSG_FASTOPEN flag will succeed but it behaves as connect() then write(). Signed-off-by: Yuchung Cheng <ycheng@google.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-19 11:02:03 -07:00
Yuchung Cheng	cf60af03ca	net-tcp: Fast Open client - sendmsg(MSG_FASTOPEN) sendmsg() (or sendto()) with MSG_FASTOPEN is a combo of connect(2) and write(2). The application should replace connect() with it to send data in the opening SYN packet. For blocking socket, sendmsg() blocks until all the data are buffered locally and the handshake is completed like connect() call. It returns similar errno like connect() if the TCP handshake fails. For non-blocking socket, it returns the number of bytes queued (and transmitted in the SYN-data packet) if cookie is available. If cookie is not available, it transmits a data-less SYN packet with Fast Open cookie request option and returns -EINPROGRESS like connect(). Using MSG_FASTOPEN on connecting or connected socket will result in simlar errno like repeating connect() calls. Therefore the application should only use this flag on new sockets. The buffer size of sendmsg() is independent of the MSS of the connection. Signed-off-by: Yuchung Cheng <ycheng@google.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-19 11:02:03 -07:00
Yuchung Cheng	8e4178c1c7	net-tcp: Fast Open client - receiving SYN-ACK On receiving the SYN-ACK after SYN-data, the client needs to a) update the cached MSS and cookie (if included in SYN-ACK) b) retransmit the data not yet acknowledged by the SYN-ACK in the final ACK of the handshake. Signed-off-by: Yuchung Cheng <ycheng@google.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-19 11:02:03 -07:00
Yuchung Cheng	783237e8da	net-tcp: Fast Open client - sending SYN-data This patch implements sending SYN-data in tcp_connect(). The data is from tcp_sendmsg() with flag MSG_FASTOPEN (implemented in a later patch). The length of the cookie in tcp_fastopen_req, init'd to 0, controls the type of the SYN. If the cookie is not cached (len==0), the host sends data-less SYN with Fast Open cookie request option to solicit a cookie from the remote. If cookie is not available (len > 0), the host sends a SYN-data with Fast Open cookie option. If cookie length is negative, the SYN will not include any Fast Open option (for fall back operations). To deal with middleboxes that may drop SYN with data or experimental TCP option, the SYN-data is only sent once. SYN retransmits do not include data or Fast Open options. The connection will fall back to regular TCP handshake. Signed-off-by: Yuchung Cheng <ycheng@google.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-19 11:02:03 -07:00
Yuchung Cheng	1fe4c481ba	net-tcp: Fast Open client - cookie cache With help from Eric Dumazet, add Fast Open metrics in tcp metrics cache. The basic ones are MSS and the cookies. Later patch will cache more to handle unfriendly middleboxes. Signed-off-by: Yuchung Cheng <ycheng@google.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-19 10:55:36 -07:00
Yuchung Cheng	2100c8d2d9	net-tcp: Fast Open base This patch impelements the common code for both the client and server. 1. TCP Fast Open option processing. Since Fast Open does not have an option number assigned by IANA yet, it shares the experiment option code 254 by implementing draft-ietf-tcpm-experimental-options with a 16 bits magic number 0xF989. This enables global experiments without clashing the scarce(2) experimental options available for TCP. When the draft status becomes standard (maybe), the client should switch to the new option number assigned while the server supports both numbers for transistion. 2. The new sysctl tcp_fastopen 3. A place holder init function Signed-off-by: Yuchung Cheng <ycheng@google.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-19 10:55:36 -07:00
stephen hemminger	83bd1b793e	ipx: move peII functions The Ethernet II wrapper is only used by IPX protocol, may have once been used by Appletalk but not currently. Therefore it makes sense to move it to the IPX dust bin and drop the exports. Build tested only. Signed-off-by: Stephen Hemminger <shemminger@vyatta.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-19 10:48:00 -07:00
Eric Dumazet	be9f4a44e7	ipv4: tcp: remove per net tcp_sock tcp_v4_send_reset() and tcp_v4_send_ack() use a single socket per network namespace. This leads to bad behavior on multiqueue NICS, because many cpus contend for the socket lock and once socket lock is acquired, extra false sharing on various socket fields slow down the operations. To better resist to attacks, we use a percpu socket. Each cpu can run without contention, using appropriate memory (local node) Additional features : 1) We also mirror the queue_mapping of the incoming skb, so that answers use the same queue if possible. 2) Setting SOCK_USE_WRITE_QUEUE socket flag speedup sock_wfree() 3) We now limit the number of in-flight RST/ACK [1] packets per cpu, instead of per namespace, and we honor the sysctl_wmem_default limit dynamically. (Prior to this patch, sysctl_wmem_default value was copied at boot time, so any further change would not affect tcp_sock limit) [1] These packets are only generated when no socket was matched for the incoming packet. Reported-by: Bill Sommerfeld <wsommerfeld@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Tom Herbert <therbert@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-19 10:35:30 -07:00
Julian Anastasov	aee06da672	ipv4: use seqlock for nh_exceptions Use global seqlock for the nh_exceptions. Call fnhe_oldest with the right hash chain. Correct the diff value for dst_set_expires. v2: after suggestions from Eric Dumazet: * get rid of spin lock fnhe_lock, rearrange update_or_create_fnhe * continue daddr search in rt_bind_exception v3: * remove the daddr check before seqlock in rt_bind_exception * restart lookup in rt_bind_exception on detected seqlock change, as suggested by David Miller Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-19 10:30:14 -07:00
John W. Linville	3e497e0215	Merge branch 'for-john' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211-next	2012-07-19 12:35:00 -04:00
David S. Miller	7fed84f622	ipv4: Fix time difference calculation in rt_bind_exception(). Reported-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-19 08:46:59 -07:00
Julian Anastasov	0cc535a299	ipv4: fix address selection in fib_compute_spec_dst ip_options_compile can be called for forwarded packets, make sure the specific-destionation address is a local one as specified in RFC 1812, 4.2.2.2 Addresses in Options Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-19 08:30:49 -07:00
Julian Anastasov	6255e5ead0	ipv4: optimize fib_compute_spec_dst call in ip_options_echo Move fib_compute_spec_dst at the only place where it is needed. Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-19 08:30:49 -07:00
Rustad, Mark D	734b65417b	net: Statically initialize init_net.dev_base_head This change eliminates an initialization-order hazard most recently seen when netprio_cgroup is built into the kernel. With thanks to Eric Dumazet for catching a bug. Signed-off-by: Mark Rustad <mark.d.rustad@intel.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-18 13:32:27 -07:00
John W. Linville	0cd06647b7	Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next	2012-07-18 14:53:10 -04:00
Eric Dumazet	ddbe503203	ipv6: add ipv6_addr_hash() helper Introduce ipv6_addr_hash() helper doing a XOR on all bits of an IPv6 address, with an optimized x86_64 version. Use it in flow dissector, as suggested by Andrew McGregor, to reduce hash collision probabilities in fq_codel (and other users of flow dissector) Use it in ip6_tunnel.c and use more bit shuffling, as suggested by David Laight, as existing hash was ignoring most of them. Use it in sunrpc and use more bit shuffling, using hash_32(). Use it in net/ipv6/addrconf.c, using hash_32() as well. As a cleanup, use it in net/ipv4/tcp_metrics.c Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Andrew McGregor <andrewmcgr@gmail.com> Cc: Dave Taht <dave.taht@gmail.com> Cc: Tom Herbert <therbert@google.com> Cc: David Laight <David.Laight@ACULAB.COM> Cc: Joe Perches <joe@perches.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-18 11:28:46 -07:00
Krishna Kumar	02756ed4a7	skbuff: Use correct allocation in skb_copy_ubufs Use correct allocation flags during copy of user space fragments to the kernel. Also "improve" couple of for loops. Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-18 09:40:54 -07:00
Saurabh	1181412c1a	net/ipv4: VTI support new module for ip_vti. New VTI tunnel kernel module, Kconfig and Makefile changes. Signed-off-by: Saurabh Mohan <saurabh.mohan@vyatta.com> Reviewed-by: Stephen Hemminger <shemminger@vyatta.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-18 09:36:12 -07:00
Saurabh	eb8637cd4a	net/ipv4: VTI support rx-path hook in xfrm4_mode_tunnel. Incorporated David and Steffen's comments. Add hook for rx-path xfmr4_mode_tunnel for VTI tunnel module. Signed-off-by: Saurabh Mohan <saurabh.mohan@vyatta.com> Reviewed-by: Stephen Hemminger <shemminger@vyatta.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-18 09:36:12 -07:00
Eric Dumazet	e371589917	tcp: refine SYN handling in tcp_validate_incoming Followup of commit `0c24604b68` (tcp: implement RFC 5961 4.2) As reported by Vijay Subramanian, we should send a challenge ACK instead of a dup ack if a SYN flag is set on a packet received out of window. This permits the ratelimiting to work as intended, and to increase correct SNMP counters. Suggested-by: Vijay Subramanian <subramanian.vijay@gmail.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Vijay Subramanian <subramanian.vijay@gmail.com> Cc: Kiran Kumar Kella <kkiran@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-18 09:31:25 -07:00
Paul Moore	89d7ae34cd	cipso: don't follow a NULL pointer when setsockopt() is called As reported by Alan Cox, and verified by Lin Ming, when a user attempts to add a CIPSO option to a socket using the CIPSO_V4_TAG_LOCAL tag the kernel dies a terrible death when it attempts to follow a NULL pointer (the skb argument to cipso_v4_validate() is NULL when called via the setsockopt() syscall). This patch fixes this by first checking to ensure that the skb is non-NULL before using it to find the incoming network interface. In the unlikely case where the skb is NULL and the user attempts to add a CIPSO option with the _TAG_LOCAL tag we return an error as this is not something we want to allow. A simple reproducer, kindly supplied by Lin Ming, although you must have the CIPSO DOI #3 configure on the system first or you will be caught early in cipso_v4_validate(): #include <sys/types.h> #include <sys/socket.h> #include <linux/ip.h> #include <linux/in.h> #include <string.h> struct local_tag { char type; char length; char info[4]; }; struct cipso { char type; char length; char doi[4]; struct local_tag local; }; int main(int argc, char **argv) { int sockfd; struct cipso cipso = { .type = IPOPT_CIPSO, .length = sizeof(struct cipso), .local = { .type = 128, .length = sizeof(struct local_tag), }, }; memset(cipso.doi, 0, 4); cipso.doi[3] = 3; sockfd = socket(AF_INET, SOCK_DGRAM, 0); #define SOL_IP 0 setsockopt(sockfd, SOL_IP, IP_OPTIONS, &cipso, sizeof(struct cipso)); return 0; } CC: Lin Ming <mlin@ss.pku.edu.cn> Reported-by: Alan Cox <alan@lxorguk.ukuu.org.uk> Signed-off-by: Paul Moore <pmoore@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-18 09:01:12 -07:00
Eric Dumazet	d3818c92af	ipv6: fix inet6_csk_xmit() We should provide to inet6_csk_route_socket a struct flowi6 pointer, so that net6_csk_xmit() works correctly instead of sending garbage. Also add some consts Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Yuchung Cheng <ycheng@google.com> Cc: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-18 08:59:58 -07:00
Eliad Peller	99102bd380	mac80211: flush stations before stop beaconing When AP interface is going down, the stations are flushed (in ieee80211_do_stop()) only after the beaconing was stopped. However, drivers might rely on stations being removed before the beaconing was stopped, in order to clean up properly. Fix it by flushing the stations on ap stop. (we already do the same for other interface types, e.g. in ieee80211_set_disassoc()) Signed-off-by: Eliad Peller <eliad@wizery.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2012-07-18 17:03:51 +02:00
Mohammed Shafi Shajakhan	ebd0fd2b1a	cfg80211: Fix mutex locking in reg_last_request_cell_base should fix the following issue [ 3229.815012] [ BUG: lock held when returning to user space! ] [ 3229.815016] 3.5.0-rc7-wl #28 Tainted: G W O [ 3229.815017] ------------------------------------------------ [ 3229.815019] wpa_supplicant/5783 is leaving the kernel with locks still held! [ 3229.815022] 1 lock held by wpa_supplicant/5783: [ 3229.815023] #0: (reg_mutex){+.+.+.}, at: [<fa65834d>] reg_last_request_cell_base+0x1d/0x60 [cfg80211] Cc: Luis Rodriguez <mcgrof@gmail.com> Signed-off-by: Mohammed Shafi Shajakhan <mohammed@qca.qualcomm.com> Tested-by: Luciano Coelho <coelho@ti.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2012-07-18 17:03:15 +02:00
Sage Weil	5bdca4e076	libceph: fix messenger retry In ancient times, the messenger could both initiate and accept connections. An artifact if that was data structures to store/process an incoming ceph_msg_connect request and send an outgoing ceph_msg_connect_reply. Sadly, the negotiation code was referencing those structures and ignoring important information (like the peer's connect_seq) from the correct ones. Among other things, this fixes tight reconnect loops where the server sends RETRY_SESSION and we (the client) retries with the same connect_seq as last time. This bug pretty easily triggered by injecting socket failures on the MDS and running some fs workload like workunits/direct_io/test_sync_io. Signed-off-by: Sage Weil <sage@inktank.com>	2012-07-17 19:35:59 -07:00
Trond Myklebust	013448c59b	SUNRPC: Add a missing spin_unlock to gss_mech_list_pseudoflavors The patch "SUNRPC: Add rpcauth_list_flavors()" introduces a new error path in gss_mech_list_pseudoflavors, but fails to release the spin lock. Reported-by: Fengguang Wu <fengguang.wu@intel.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2012-07-17 17:02:57 -04:00
Eric Dumazet	5abf7f7e0f	ipv4: fix rcu splat free_nh_exceptions() should use rcu_dereference_protected(..., 1) since its called after one RCU grace period. Also add some const-ification in recent code. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-17 13:47:33 -07:00
David S. Miller	d3a25c980f	ipv4: Fix nexthop exception hash computation. Need to mask it with (FNHE_HASH_SIZE - 1). Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-17 13:23:08 -07:00
John W. Linville	d369f7b2b2	Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/linville/wireless	2012-07-17 15:31:33 -04:00
John W. Linville	707be0ae13	Merge branch 'for-john' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211-next	2012-07-17 15:07:31 -04:00
David S. Miller	a6ff1a2f1e	Merge branch 'nexthop_exceptions' These patches implement the final mechanism necessary to really allow us to go without the route cache in ipv4. We need a place to have long-term storage of PMTU/redirect information which is independent of the routes themselves, yet does not get us back into a situation where we have to write to metrics or anything like that. For this we use an "next-hop exception" table in the FIB nexthops. The one thing I desperately want to avoid is having to create clone routes in the FIB trie for this purpose, because that is very expensive. However, I'm willing to entertain such an idea later if this current scheme proves to have downsides that the FIB trie variant would not have. In order to accomodate this any such scheme, we need to be able to produce a full flow key at PMTU/redirect time. That required an adjustment of the interface call-sites used to propagate these events. For a PMTU/redirect with a fully specified socket, we pass that socket and use it to produce the flow key. Otherwise we use a passed in SKB to formulate the key. There are two cases that need to be distinguished, ICMP message processing (in which case the IP header is at skb->data) and output packet processing (mostly tunnels, and in all such cases the IP header is at ip_hdr(skb)). We also have to make the code able to handle the case where the dst itself passed into the dst_ops->{update_pmtu,redirect} method is invalidated. This matters for calls from sockets that have cached that route. We provide a inet{,6} helper function for this purpose, and edit SCTP specially since it caches routes at the transport rather than socket level. Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-17 10:48:26 -07:00
Jiri Pirko	30fdd8a082	netpoll: move np->dev and np->dev_name init into __netpoll_setup() Signed-off-by: Jiri Pirko <jiri@resnulli.us> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-17 09:02:36 -07:00
David S. Miller	4895c771c7	ipv4: Add FIB nexthop exceptions. In a regime where we have subnetted route entries, we need a way to store persistent storage about destination specific learned values such as redirects and PMTU values. This is implemented here via nexthop exceptions. The initial implementation is a 2048 entry hash table with relaiming starting at chain length 5. A more sophisticated scheme can be devised if that proves necessary. Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-17 08:48:50 -07:00
Eric Dumazet	0c24604b68	tcp: implement RFC 5961 4.2 Implement the RFC 5691 mitigation against Blind Reset attack using SYN bit. Section 4.2 of RFC 5961 advises to send a Challenge ACK and drop incoming packet, instead of resetting the session. Add a new SNMP counter to count number of challenge acks sent in response to SYN packets. (netstat -s \| grep TCPSYNChallenge) Remove obsolete TCPAbortOnSyn, since we no longer abort a TCP session because of a SYN flag. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Kiran Kumar Kella <kkiran@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-17 07:40:46 -07:00
David S. Miller	6700c2709c	net: Pass optional SKB and SK arguments to dst_ops->{update_pmtu,redirect}() This will be used so that we can compose a full flow key. Even though we have a route in this context, we need more. In the future the routes will be without destination address, source address, etc. keying. One ipv4 route will cover entire subnets, etc. In this environment we have to have a way to possess persistent storage for redirects and PMTU information. This persistent storage will exist in the FIB tables, and that's why we'll need to be able to rebuild a full lookup flow key here. Using that flow key will do a fib_lookup() and create/update the persistent entry. Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-17 03:29:28 -07:00
David S. Miller	602e65a3b0	Merge branch 'master' of git://1984.lsi.us.es/nf Pablo Neira Ayuso says: ==================== I know that we're in fairly late stage to request pulls, but the IPVS people pinged me with little patches with oops fixes last week. One of them was recently introduced (during the 3.4 development cycle) while cleaning up the IPVS netns support. They are: * Fix one regression introduced in 3.4 while cleaning up the netns support for IPVS, from Julian Anastasov. * Fix one oops triggered due to resetting the conntrack attached to the skb instead of just putting it in the forward hook, from Lin Ming. This problem seems to be there since 2.6.37 according to Simon Horman. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-17 03:19:33 -07:00
Eliad Peller	88bc40e8c3	mac80211: go out of PS before sending disassoc on disassoc, ieee80211_set_disassoc() goes out of PS before indicating BSS_CHANGED_ASSOC (not sure why this is needed, but some drivers might count on the current behavior). However, it does it after sending the disassoc frame, which results in null-data frame being sent (in order to go out of ps) after we were already sent the disassoc, which is invalid. Fix it by going out of ps before sending the disassoc. Signed-off-by: Eliad Peller <eliad@wizery.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2012-07-17 12:17:42 +02:00
Luis R. Rodriguez	14cdf11201	cfg80211: remove regulatory_update() regulatory_update() just calls wiphy_update_regulatory(). wiphy_update_regulatory() assumes you already have the reg_mutex held so just move the call within locking context and kill the superfluous regulatory_update(). Signed-off-by: Luis R. Rodriguez <mcgrof@qca.qualcomm.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2012-07-17 12:16:41 +02:00
Luis R. Rodriguez	f8a1c77457	cfg80211: make regulatory_update() static Now that we have wiphy_regulatory_register() we can tuck away the core's regulatory_update() call there and make it static. Signed-off-by: Luis R. Rodriguez <mcgrof@qca.qualcomm.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2012-07-17 12:16:40 +02:00
Luis R. Rodriguez	bfead0808c	cfg80211: rename reg_device_remove() to wiphy_regulatory_deregister() This makes it clearer what we're doing. This now makes a bit more sense given that regardless of the wiphy if the cell base station hint feature is supported we will be modifying the way the regulatory core behaves. Signed-off-by: Luis R. Rodriguez <mcgrof@qca.qualcomm.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2012-07-17 12:16:39 +02:00
Luis R. Rodriguez	57b5ce072e	cfg80211: add cellular base station regulatory hint support Cellular base stations can provide hints to cfg80211 about where they think we are. This can be done for example on a cell phone. To enable these hints we simply allow them through as user regulatory hints but we allow userspace to clasify the hint as either coming directly from the user or coming from a cellular base station. This option is only available when you enable CONFIG_CFG80211_CERTIFICATION_ONUS. The base station hints themselves will not be processed by the core unless at least one device on the system supports this feature. Signed-off-by: Luis R. Rodriguez <mcgrof@qca.qualcomm.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2012-07-17 12:16:39 +02:00
Luis R. Rodriguez	b594bab902	cfg80211: add CONFIG_CFG80211_CERTIFICATION_ONUS This adds CONFIG_CFG80211_CERTIFICATION_ONUS which is to be used for features / code which require a bit of work on the system integrator's part to ensure that the system will still pass 802.11 regulatory certification. This option is also usable for researchers and experimenters looking to add code in the kernel without impacting compliant code. We'd use CONFIG_EXPERT alone but it seems that most standard Linux distributions are enabling CONFIG_EXPERT already. This allows us to define 802.11 specific kernel features under a flag that is intended by design to be disabled by standard Linux distributions, and only enabled by system integrators or distributions that have done work to ensure regulatory certification on the system with the enabled features. Signed-off-by: Luis R. Rodriguez <mcgrof@qca.qualcomm.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2012-07-17 12:13:51 +02:00
Julian Anastasov	283283c4da	ipvs: fix oops in ip_vs_dst_event on rmmod After commit `39f618b4fd` (3.4) "ipvs: reset ipvs pointer in netns" we can oops in ip_vs_dst_event on rmmod ip_vs because ip_vs_control_cleanup is called after the ipvs_core_ops subsys is unregistered and net->ipvs is NULL. Fix it by exiting early from ip_vs_dst_event if ipvs is NULL. It is safe because all services and dests for the net are already freed. Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Simon Horman <horms@verge.net.au> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2012-07-17 12:00:58 +02:00
Kalle Valo	959085352b	cfg80211: fix set_regdom() to cancel requests with same alpha2 While adding regulatory support to ath6kl I noticed that I easily got the regulatory code confused. The way to reproduce the bug was: 1. iw reg set FI (in userspace) 2. cfg80211 calls ath6kl_reg_notify(FI) 3. ath6kl sets regdomain in firmware 4. firmware sends regdomain event to notify about the new regdomain (FI) 5. ath6kl calls regulatory_hint(FI) And this (from FI to FI transition) confuses cfg80211 and after that I only get "Pending regulatory request, waiting for it to be processed...." messages and regdomain changes won't work anymore. The reason why ath6kl calls regulatory_hint() is that firmware can change the regulatory domain by it's own, for example due to 11d IEs. I could of course workaround this in ath6kl but I think it's better to handle the case in cfg80211. The fix is pretty simple, use a different error code if the regdomain is same and then just set the request processed so that it doesn't block new requests. Signed-off-by: Kalle Valo <kvalo@qca.qualcomm.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2012-07-17 12:00:43 +02:00
Thomas Pedersen	84f10708f7	cfg80211: support TX error rate CQM Let the user configure serveral TX error conection quality monitoring parameters: % error rate, survey interval, and # of attempted packets. On exceeding the TX failure rate over the given interval, the driver will send a CQM notify event with the actual TX failure rate and packets attempted. Signed-off-by: Thomas Pedersen <c_tpeder@qca.qualcomm.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2012-07-17 11:57:23 +02:00
Johannes Berg	00f5335079	nl80211: add wdev ID as u64 as it should In one of my previous patches I erroneously used nla_put_u32 for the wdev_id, fix that to use nla_put_u64. Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2012-07-17 11:53:57 +02:00
Nicolas Cavallari	7f9f78ab96	mac80211: fix tx-mgmt cookie value being left uninitialized commit "mac80211: unify SW/offload remain-on-channel" moved the cookie assignment from ieee80211_mgmt_tx() to ieee80211_start_roc_work(). But the latter is only called where offchannel is needed. If offchannel isn't needed/used, a uninitialized cookie value would be returned to userspace. This patch sets the cookie value when offchannel isn't used. Signed-off-by: Nicolas Cavallari <cavallar@lri.fr> Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2012-07-17 11:22:42 +02:00
Eric Dumazet	282f23c6ee	tcp: implement RFC 5961 3.2 Implement the RFC 5691 mitigation against Blind Reset attack using RST bit. Idea is to validate incoming RST sequence, to match RCV.NXT value, instead of previouly accepted window : (RCV.NXT <= SEG.SEQ < RCV.NXT+RCV.WND) If sequence is in window but not an exact match, send a "challenge ACK", so that the other part can resend an RST with the appropriate sequence. Add a new sysctl, tcp_challenge_ack_limit, to limit number of challenge ACK sent per second. Add a new SNMP counter to count number of challenge acks sent. (netstat -s \| grep TCPChallengeACK) Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Kiran Kumar Kella <kkiran@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-17 01:36:20 -07:00
Li Wei	a858d64b77	ipv6: fix unappropriate errno returned for non-multicast address We need to check the passed in multicast address and return appropriate errno(EINVAL) if it is not valid. And it's no need to walk through the ipv6_mc_list in this situation. Signed-off-by: Li Wei <lw@cn.fujitsu.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-17 01:35:03 -07:00
Masanari Iida	ad8c94532a	irda: Fix typo in irda Correct spelling typo in irda. Signed-off-by: Masanari Iida <standby24x7@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-16 23:23:52 -07:00
Ioan Orghici	db28aafad9	sctp: fix sparse warning for sctp_init_cause_fixed Fix the following sparse warning: * symbol 'sctp_init_cause_fixed' was not declared. Should it be static? Signed-off-by: Ioan Orghici <ioanorghici@gmail.com> Acked-by: Vlad Yasevich <vyasevich@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-16 23:23:52 -07:00
Alan Cox	ef764a13b8	ax25: Fix missing break At least there seems to be no reason to disallow ROSE sockets when NETROM is loaded. Signed-off-by: Alan Cox <alan@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-16 23:22:36 -07:00
Eric Dumazet	5a308f40bf	netem: refine early skb orphaning netem does an early orphaning of skbs. Doing so breaks TCP Small Queue or any mechanism relying on socket sk_wmem_alloc feedback. Ideally, we should perform this orphaning after the rate module and before the delay module, to mimic what happens on a real link : skb orphaning is indeed normally done at TX completion, before the transit on the link. +-------+ +--------+ +---------------+ +-----------------+ + Qdisc +---> Device +--> TX completion +--> links / hops +-> + + + xmit + + skb orphaning + + propagation + +-------+ +--------+ +---------------+ +-----------------+ < rate limiting > < delay, drops, reorders > If netem is used without delay feature (drops, reorders, rate limiting), then we should avoid early skb orphaning, to keep pressure on sockets as long as packets are still in qdisc queue. Ideally, netem should be refactored to implement delay module as the last stage. Current algorithm merges the two phases (rate limiting + delay) so its not correct. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Hagen Paul Pfeifer <hagen@jauu.net> Cc: Mark Gordon <msg@google.com> Cc: Andreas Terzis <aterzis@google.com> Cc: Yuchung Cheng <ycheng@google.com> Acked-by: Stephen Hemminger <shemminger@vyatta.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-16 23:08:33 -07:00
Sjur Brændeland	96f80d123e	caif: Fix access to freed pernet memory unregister_netdevice_notifier() must be called before unregister_pernet_subsys() to avoid accessing already freed pernet memory. This fixes the following oops when doing rmmod: Call Trace: [<ffffffffa0f802bd>] caif_device_notify+0x4d/0x5a0 [caif] [<ffffffff81552ba9>] unregister_netdevice_notifier+0xb9/0x100 [<ffffffffa0f86dcc>] caif_device_exit+0x1c/0x250 [caif] [<ffffffff810e7734>] sys_delete_module+0x1a4/0x300 [<ffffffff810da82d>] ? trace_hardirqs_on_caller+0x15d/0x1e0 [<ffffffff813517de>] ? trace_hardirqs_on_thunk+0x3a/0x3 [<ffffffff81696bad>] system_call_fastpath+0x1a/0x1f RIP [<ffffffffa0f7f561>] caif_get+0x51/0xb0 [caif] Signed-off-by: Sjur Brændeland <sjur.brandeland@stericsson.com> Acked-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-16 23:06:20 -07:00
Gao feng	ef209f1598	net: cgroup: fix access the unallocated memory in netprio cgroup there are some out of bound accesses in netprio cgroup. now before accessing the dev->priomap.priomap array,we only check if the dev->priomap exist.and because we don't want to see additional bound checkings in fast path, so we should make sure that dev->priomap is null or array size of dev->priomap.priomap is equal to max_prioidx + 1; so in write_priomap logic,we should call extend_netdev_table when dev->priomap is null and dev->priomap.priomap_len < max_len. and in cgrp_create->update_netdev_tables logic,we should call extend_netdev_table only when dev->priomap exist and dev->priomap.priomap_len < max_len. and it's not needed to call update_netdev_tables in write_priomap, we can only allocate the net device's priomap which we change through net_prio.ifpriomap. this patch also add a return value for update_netdev_tables & extend_netdev_table, so when new_priomap is allocated failed, write_priomap will stop to access the priomap,and return -ENOMEM back to the userspace to tell the user what happend. Change From v3: 1. add rtnl protect when reading max_prioidx in write_priomap. 2. only call extend_netdev_table when map->priomap_len < max_len, this will make sure array size of dev->map->priomap always bigger than any prioidx. 3. add a function write_update_netdev_table to make codes clear. Change From v2: 1. protect extend_netdev_table by RTNL. 2. when extend_netdev_table failed,call dev_put to reduce device's refcount. Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com> Cc: Neil Horman <nhorman@tuxdriver.com> Cc: Eric Dumazet <edumazet@google.com> Acked-by: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-16 23:00:43 -07:00
Thomas Graf	036be6dbcf	bridge: Fix enforcement of multicast hash_max limit The hash size is doubled when it needs to grow and compared against hash_max. The >= comparison will limit the hash table size to half of what is expected i.e. the default 512 hash_max will not allow the hash table to grow larger than 256. Also print the hash table limit instead of the desirable size when the limit is reached. Signed-off-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-16 22:59:30 -07:00
Denis Ovsienko	f0396f60d7	ipv6: fix RTPROT_RA markup of RA routes w/nexthops Userspace implementations of network routing protocols sometimes need to tell RA-originated IPv6 routes from other kernel routes to make proper routing decisions. This makes most sense for RA routes with nexthops, namely, default routes and Route Information routes. The intended mean of preserving RA route origin in a netlink message is through indicating RTPROT_RA as protocol code. Function rt6_fill_node() tried to do that for default routes, but its test condition was taken wrong. This change is modeled after the original mailing list posting by Jeff Haran. It fixes the test condition for default route case and sets the same behaviour for Route Information case (both types use nexthops). Handling of the 3rd RA route type, Prefix Information, is left unchanged, as it stands for interface connected routes (without nexthops). Signed-off-by: Denis Ovsienko <infrastation@yandex.ru> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-16 22:55:54 -07:00
Tony Cheneau	5e96855fc5	6lowpan: Change byte order when storing/accessing to len field Lenght field should be encoded using big endian byte order, such as intend in the specs. As it is currently written, the len field would not be decoded properly on an implementation using the correct byte ordering. Hence, it could lead to interroperability issues. Also, I rewrote the code so that iphc0 argument of lowpan_alloc_new_frame could be removed. Signed-off-by: Tony Cheneau <tony.cheneau@amnesiak.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-16 22:52:02 -07:00
Tony Cheneau	4576039ffc	6lowpan: Change byte order when storing/accessing u16 tag The tag field should be stored and accessed using big endian byte order (as intended in the specs). Or else, when displayed with a trafic analyser, such a Wireshark, the field not properly displayed (e.g. 0x01 00 instead of 0x00 01, and so on). Signed-off-by: Tony Cheneau <tony.cheneau@amnesiak.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-16 22:51:15 -07:00
Tony Cheneau	d4787a1543	6lowpan: Fix null pointer dereference in UDP uncompression function When a UDP packet gets fragmented, a crash will occur at reassembly time. This is because skb->transport_header is not set during earlier period of fragment reassembly. As a consequence, call to udp_hdr() return NULL and uh (which is NULL) gets dereferenced without much test. Signed-off-by: Tony Cheneau <tony.cheneau@amnesiak.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-16 22:51:15 -07:00
David S. Miller	07689b0a5c	Merge branch 'tipc_net-next' of git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux Paul Gortmaker says: ==================== This is the same eight commits as sent for review last week[1], with just the incorporation of the pr_fmt change as suggested by JoeP. There was no additional change requests, so unless you can see something else you'd like me to change, please pull. ... Erik Hugne (5): tipc: use standard printk shortcut macros (pr_err etc.) tipc: remove TIPC packet debugging functions and macros tipc: simplify print buffer handling in tipc_printf tipc: phase out most of the struct print_buf usage tipc: remove print_buf and deprecated log buffer code Paul Gortmaker (3): tipc: factor stats struct out of the larger link struct tipc: limit error messages relating to memory leak to one line tipc: simplify link_print by divorcing it from using tipc_printf ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-16 22:35:04 -07:00
Neil Horman	2eebc1e188	sctp: Fix list corruption resulting from freeing an association on a list A few days ago Dave Jones reported this oops: [22766.294255] general protection fault: 0000 [#1] PREEMPT SMP [22766.295376] CPU 0 [22766.295384] Modules linked in: [22766.387137] ffffffffa169f292 6b6b6b6b6b6b6b6b ffff880147c03a90 ffff880147c03a74 [22766.387135] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 00000000000 [22766.387136] Process trinity-watchdo (pid: 10896, threadinfo ffff88013e7d2000, [22766.387137] Stack: [22766.387140] ffff880147c03a10 [22766.387140] ffffffffa169f2b6 [22766.387140] ffff88013ed95728 [22766.387143] 0000000000000002 [22766.387143] 0000000000000000 [22766.387143] ffff880003fad062 [22766.387144] ffff88013c120000 [22766.387144] [22766.387145] Call Trace: [22766.387145] <IRQ> [22766.387150] [<ffffffffa169f292>] ? __sctp_lookup_association+0x62/0xd0 [sctp] [22766.387154] [<ffffffffa169f2b6>] __sctp_lookup_association+0x86/0xd0 [sctp] [22766.387157] [<ffffffffa169f597>] sctp_rcv+0x207/0xbb0 [sctp] [22766.387161] [<ffffffff810d4da8>] ? trace_hardirqs_off_caller+0x28/0xd0 [22766.387163] [<ffffffff815827e3>] ? nf_hook_slow+0x133/0x210 [22766.387166] [<ffffffff815902fc>] ? ip_local_deliver_finish+0x4c/0x4c0 [22766.387168] [<ffffffff8159043d>] ip_local_deliver_finish+0x18d/0x4c0 [22766.387169] [<ffffffff815902fc>] ? ip_local_deliver_finish+0x4c/0x4c0 [22766.387171] [<ffffffff81590a07>] ip_local_deliver+0x47/0x80 [22766.387172] [<ffffffff8158fd80>] ip_rcv_finish+0x150/0x680 [22766.387174] [<ffffffff81590c54>] ip_rcv+0x214/0x320 [22766.387176] [<ffffffff81558c07>] __netif_receive_skb+0x7b7/0x910 [22766.387178] [<ffffffff8155856c>] ? __netif_receive_skb+0x11c/0x910 [22766.387180] [<ffffffff810d423e>] ? put_lock_stats.isra.25+0xe/0x40 [22766.387182] [<ffffffff81558f83>] netif_receive_skb+0x23/0x1f0 [22766.387183] [<ffffffff815596a9>] ? dev_gro_receive+0x139/0x440 [22766.387185] [<ffffffff81559280>] napi_skb_finish+0x70/0xa0 [22766.387187] [<ffffffff81559cb5>] napi_gro_receive+0xf5/0x130 [22766.387218] [<ffffffffa01c4679>] e1000_receive_skb+0x59/0x70 [e1000e] [22766.387242] [<ffffffffa01c5aab>] e1000_clean_rx_irq+0x28b/0x460 [e1000e] [22766.387266] [<ffffffffa01c9c18>] e1000e_poll+0x78/0x430 [e1000e] [22766.387268] [<ffffffff81559fea>] net_rx_action+0x1aa/0x3d0 [22766.387270] [<ffffffff810a495f>] ? account_system_vtime+0x10f/0x130 [22766.387273] [<ffffffff810734d0>] __do_softirq+0xe0/0x420 [22766.387275] [<ffffffff8169826c>] call_softirq+0x1c/0x30 [22766.387278] [<ffffffff8101db15>] do_softirq+0xd5/0x110 [22766.387279] [<ffffffff81073bc5>] irq_exit+0xd5/0xe0 [22766.387281] [<ffffffff81698b03>] do_IRQ+0x63/0xd0 [22766.387283] [<ffffffff8168ee2f>] common_interrupt+0x6f/0x6f [22766.387283] <EOI> [22766.387284] [22766.387285] [<ffffffff8168eed9>] ? retint_swapgs+0x13/0x1b [22766.387285] Code: c0 90 5d c3 66 0f 1f 44 00 00 4c 89 c8 5d c3 0f 1f 00 55 48 89 e5 48 83 ec 20 48 89 5d e8 4c 89 65 f0 4c 89 6d f8 66 66 66 66 90 <0f> b7 87 98 00 00 00 48 89 fb 49 89 f5 66 c1 c0 08 66 39 46 02 [22766.387307] [22766.387307] RIP [22766.387311] [<ffffffffa168a2c9>] sctp_assoc_is_match+0x19/0x90 [sctp] [22766.387311] RSP <ffff880147c039b0> [22766.387142] ffffffffa16ab120 [22766.599537] ---[ end trace 3f6dae82e37b17f5 ]--- [22766.601221] Kernel panic - not syncing: Fatal exception in interrupt It appears from his analysis and some staring at the code that this is likely occuring because an association is getting freed while still on the sctp_assoc_hashtable. As a result, we get a gpf when traversing the hashtable while a freed node corrupts part of the list. Nominally I would think that an mibalanced refcount was responsible for this, but I can't seem to find any obvious imbalance. What I did note however was that the two places where we create an association using sctp_primitive_ASSOCIATE (__sctp_connect and sctp_sendmsg), have failure paths which free a newly created association after calling sctp_primitive_ASSOCIATE. sctp_primitive_ASSOCIATE brings us into the sctp_sf_do_prm_asoc path, which issues a SCTP_CMD_NEW_ASOC side effect, which in turn adds a new association to the aforementioned hash table. the sctp command interpreter that process side effects has not way to unwind previously processed commands, so freeing the association from the __sctp_connect or sctp_sendmsg error path would lead to a freed association remaining on this hash table. I've fixed this but modifying sctp_[un]hash_established to use hlist_del_init, which allows us to proerly use hlist_unhashed to check if the node is on a hashlist safely during a delete. That in turn alows us to safely call sctp_unhash_established in the __sctp_connect and sctp_sendmsg error paths before freeing them, regardles of what the associations state is on the hash list. I noted, while I was doing this, that the __sctp_unhash_endpoint was using hlist_unhsashed in a simmilar fashion, but never nullified any removed nodes pointers to make that function work properly, so I fixed that up in a simmilar fashion. I attempted to test this using a virtual guest running the SCTP_RR test from netperf in a loop while running the trinity fuzzer, both in a loop. I wasn't able to recreate the problem prior to this fix, nor was I able to trigger the failure after (neither of which I suppose is suprising). Given the trace above however, I think its likely that this is what we hit. Signed-off-by: Neil Horman <nhorman@tuxdriver.com> Reported-by: davej@redhat.com CC: davej@redhat.com CC: "David S. Miller" <davem@davemloft.net> CC: Vlad Yasevich <vyasevich@gmail.com> CC: Sridhar Samudrala <sri@us.ibm.com> CC: linux-sctp@vger.kernel.org Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-16 22:32:26 -07:00
Andrey Vagin	51d7cccf07	net: make sock diag per-namespace Before this patch sock_diag works for init_net only and dumps information about sockets from all namespaces. This patch expands sock_diag for all name-spaces. It creates a netlink kernel socket for each netns and filters data during dumping. v2: filter accoding with netns in all places remove an unused variable. Cc: "David S. Miller" <davem@davemloft.net> Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru> Cc: James Morris <jmorris@namei.org> Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org> Cc: Patrick McHardy <kaber@trash.net> Cc: Pavel Emelyanov <xemul@parallels.com> CC: Eric Dumazet <eric.dumazet@gmail.com> Cc: linux-kernel@vger.kernel.org Cc: netdev@vger.kernel.org Signed-off-by: Andrew Vagin <avagin@openvz.org> Acked-by: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-16 22:31:34 -07:00
Eric Dumazet	a6df1ae938	tcp: add OFO snmp counters Add three SNMP TCP counters, to better track TCP behavior at global stage (netstat -s), when packets are received Out Of Order (OFO) TCPOFOQueue : Number of packets queued in OFO queue TCPOFODrop : Number of packets meant to be queued in OFO but dropped because socket rcvbuf limit hit. TCPOFOMerge : Number of packets in OFO that were merged with other packets. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-16 22:12:00 -07:00
Chuck Lever	6a1a1e34dc	SUNRPC: Add rpcauth_list_flavors() The gss_mech_list_pseudoflavors() function provides a list of currently registered GSS pseudoflavors. This list does not include any non-GSS flavors that have been registered with the RPC client. nfs4_find_root_sec() currently adds these extra flavors by hand. Instead, nfs4_find_root_sec() should be looking at the set of flavors that have been explicitly registered via rpcauth_register(). And, other areas of code will soon need the same kind of list that contains all flavors the kernel currently knows about (see below). Rather than cloning the open-coded logic in nfs4_find_root_sec() to those new places, introduce a generic RPC function that generates a full list of registered auth flavors and pseudoflavors. A new rpc_authops method is added that lists a flavor's pseudoflavors, if it has any. I encountered an interesting module loader loop when I tried to get the RPC client to invoke gss_mech_list_pseudoflavors() by name. This patch is a pre-requisite for server trunking discovery, and a pre-requisite for fixing up the in-kernel mount client to do better automatic security flavor selection. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2012-07-16 15:12:15 -04:00
Trond Myklebust	8626e4a426	Merge commit '9249e17fe094d853d1ef7475dd559a2cc7e23d42' into nfs-for-3.6 Resolve conflicts with the VFS atomic open and sget changes. Conflicts: fs/nfs/nfs4proc.c	2012-07-16 12:01:42 -04:00
Johan Hedberg	83ce9a06b5	Bluetooth: Change page scan interval in fast connectable mode This patch is based on a user space (hciops) patch which never made it upstream but does make sense to include in the mgmt part of the kernel. (User space) commit message from Dmitriy Paliy: " Page scan interval in fast connectable mode is changed from 22.5 msec to 160 msec to perform less aggressive page scanning. This is done accordingly to controller vendor recommendation. Primary concern is that current parameters 22.5 interval, 11.25 window, and interleaved scanning occupy whole radio bandwidth. Changing interval to 160 msec should be sufficient for both speeding up connection establishment and leaving space for other activities, like inquiry scan, e.g. " Signed-off-by: Johan Hedberg <johan.hedberg@intel.com> Signed-off-by: Gustavo Padovan <gustavo.padovan@collabora.co.uk>	2012-07-16 10:50:11 -03:00
Eric Dumazet	310e158cc3	net: respect GFP_DMA in __netdev_alloc_skb() Few drivers use GFP_DMA allocations, and netdev_alloc_frag() doesn't allocate pages in DMA zone. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-16 04:17:49 -07:00
David S. Miller	02f3d4ce9e	sctp: Adjust PMTU updates to accomodate route invalidation. This adjusts the call to dst_ops->update_pmtu() so that we can transparently handle the fact that, in the future, the dst itself can be invalidated by the PMTU update (when we have non-host routes cached in sockets). Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-16 03:57:14 -07:00
David S. Miller	35ad9b9cf7	ipv6: Add helper inet6_csk_update_pmtu(). This is the ipv6 version of inet_csk_update_pmtu(). Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-16 03:44:56 -07:00
David S. Miller	80d0a69fc5	ipv4: Add helper inet_csk_update_pmtu(). This abstracts away the call to dst_ops->update_pmtu() so that we can transparently handle the fact that, in the future, the dst itself can be invalidated by the PMTU update (when we have non-host routes cached in sockets). So we try to rebuild the socket cached route after the method invocation if necessary. This isn't used by SCTP because it needs to cache dsts per-transport, and thus will need it's own local version of this helper. Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-16 03:28:06 -07:00
Mat Martineau	c20f8e35ca	Bluetooth: Use tx window from config response for ack timing This change addresses an L2CAP ERTM throughput problem when a remote device does not fully utilize the available transmit window. The L2CAP ERTM transmit window size determines the maximum number of unacked frames that may be outstanding at any time. It is configured separately for each direction of an ERTM connection. Each side sends a configuration request with a tx_win field indicating how many unacked frames it is capable of receiving before sending an ack. The configuration response's tx_win field shows how many frames the transmitter will actually send before waiting for an ack. It's important to trace both the actual transmit window (to check for validity of incoming frames) and the number of frames that the transmitter will send before waiting (to send acks at the appropriate time). Now there are separate tx_win and ack_win values. ack_win is updated based on configuration responses, and is used to determine when acks are sent. Signed-off-by: Mat Martineau <mathewm@codeaurora.org> Signed-off-by: Gustavo Padovan <gustavo.padovan@collabora.co.uk>	2012-07-15 12:18:29 -03:00
Theodore Ts'o	7bf2357524	net: feed /dev/random with the MAC address when registering a device Cc: David Miller <davem@davemloft.net> Cc: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: stable@vger.kernel.org	2012-07-14 20:17:46 -04:00
Steffen Klassert	141e369de6	xfrm: Initialize the struct xfrm_dst behind the dst_enty field We start initializing the struct xfrm_dst at the first field behind the struct dst_enty. This is error prone because it might leave a new field uninitialized. So start initializing the struct xfrm_dst right behind the dst_entry. Suggested-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-14 00:29:12 -07:00
Steffen Klassert	8104891b86	ipv6: Initialize the struct rt6_info behind the dst_enty field We start initializing the struct rt6_info at the first field behind the struct dst_enty. This is error prone because it might leave a new field uninitialized. So start initializing the struct rt6_info right behind the dst_entry. Suggested-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-14 00:29:12 -07:00
David S. Miller	921a678cb6	Merge branch 'for-davem' of git://git.kernel.org/pub/scm/linux/kernel/git/linville/wireless-next John Linville says: ==================== Several drivers see updates: mwifiex, ath9k, iwlwifi, brcmsmac, wlcore/wl12xx/wl18xx, and a handful of others. The bcma bus got a lot of attention from Hauke Mehrtens. The cfg80211 component gets a flurry of patches for multi-channel support, and the mac80211 component gets the first few VHT (11ac) and 60GHz (11ad) patches. This also includes the removal of the iwmc3200 drivers, since the hardware never became available to normal people. Additionally, the NFC subsystem gets a series of updates. According to Samuel, "Here are the interesting bits: - A better error management for the HCI stack. - An LLCP "late" binding implementation for a better NFC SAP usage. SAPs are now reserved only when there's a client for it. - Support for Sony RC-S360 (a.k.a. PaSoRi) pn533 based dongle. We can read and write NFC tags and also establish a p2p link with this dongle now. - A few LLCP fixes." Finally, this includes another pull of the fixes from the wireless tree in order to resolve some merge issues. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-13 23:02:28 -07:00
Erik Hugne	869dd4662f	tipc: remove print_buf and deprecated log buffer code The internal log buffer handling functions can now safely be removed since there is no code using it anymore. Requests to interact with the internal tipc log buffer over netlink (in config.c) will report 'obsolete command'. This represents the final removal of any references to a struct print_buf, and the removal of the struct itself. We also get rid of a TIPC specific Kconfig in the process. Finally, log.h is removed since it is not needed anymore. Signed-off-by: Erik Hugne <erik.hugne@ericsson.com> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>	2012-07-13 19:34:43 -04:00
Erik Hugne	dc1aed37d1	tipc: phase out most of the struct print_buf usage The tipc_printf is renamed to tipc_snprintf, as the new name describes more what the function actually does. It is also changed to take a buffer and length parameter and return number of characters written to the buffer. All callers of this function that used to pass a print_buf are updated. Final removal of the struct print_buf itself will be done synchronously with the pending removal of the deprecated logging code that also was using it. Functions that build up a response message with a list of ports, nametable contents etc. are changed to return the number of characters written to the output buffer. This information was previously hidden in a field of the print_buf struct, and the number of chars written was fetched with a call to tipc_printbuf_validate. This function is removed since it is no longer referenced nor needed. A generic max size ULTRA_STRING_MAX_LEN is defined, named in keeping with the existing TIPC_TLV_ULTRA_STRING, and the various definitions in port, link and nametable code that largely duplicated this information are removed. This means that amount of link statistics that can be returned is now increased from 2k to 32k. The buffer overflow check is now done just before the reply message is passed over netlink or TIPC to a remote node and the message indicating a truncated buffer is changed to a less dramatic one (less CAPS), placed at the end of the message. Signed-off-by: Erik Hugne <erik.hugne@ericsson.com> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>	2012-07-13 19:33:28 -04:00
Erik Hugne	e2dbd60134	tipc: simplify print buffer handling in tipc_printf tipc_printf was previously used both to construct debug traces and to append data to buffers that should be sent over netlink to the tipc-config application. A global print_buffer was used to format the string before it was copied to the actual output buffer. This could lead to concurrent access of the global print_buffer, which then had to be lock protected. This is simplified by changing tipc_printf to append data directly to the output buffer using vscnprintf. With the new implementation of tipc_printf, there is no longer any risk of concurrent access to the internal log buffer, so the lock (and the comments describing it) are no longer strictly necessary. However, there are still a few functions that do grab this lock before resizing/dumping the log buffer. We leave the lock, and these functions untouched since they will be removed with a subsequent commit that drops the deprecated log buffer handling code Signed-off-by: Erik Hugne <erik.hugne@ericsson.com> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>	2012-07-13 19:28:28 -04:00
Paul Gortmaker	5deedde9fa	tipc: simplify link_print by divorcing it from using tipc_printf To pave the way for a pending cleanup of tipc_printf, and removal of struct print_buf entirely, we make that task simpler by converting link_print to issue its messages with standard printk infrastructure. [Original idea separated from a larger patch from Erik Hugne <erik.hugne@ericsson.com>] Cc: Erik Hugne <erik.hugne@ericsson.com> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>	2012-07-13 19:27:57 -04:00
Erik Hugne	568fc588fc	tipc: remove TIPC packet debugging functions and macros The link queue traces and packet level debug functions served a purpose during early development, but are now redundant since there are other, more capable tools available for debugging at the packet level. The TIPC_DEBUG Kconfig option is removed since it does not provide any extra debugging features anymore. This gets rid of a lot of tipc_printf usages, which will make the pending cleanup work of that function easier. Signed-off-by: Erik Hugne <erik.hugne@ericsson.com> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>	2012-07-13 19:25:16 -04:00
Erik Hugne	2cf8aa19fe	tipc: use standard printk shortcut macros (pr_err etc.) All messages should go directly to the kernel log. The TIPC specific error, warning, info and debug trace macro's are removed and all references replaced with pr_err, pr_warn, pr_info and pr_debug. Commonly used sub-strings are explicitly declared as a const char to reduce .text size. Note that this means the debug messages (changed to pr_debug), are now enabled through dynamic debugging, instead of a TIPC specific Kconfig option (TIPC_DEBUG). The latter will be phased out completely Signed-off-by: Erik Hugne <erik.hugne@ericsson.com> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> [PG: use pr_fmt as suggested by Joe Perches <joe@perches.com>] Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>	2012-07-13 19:24:44 -04:00
David S. Miller	85b91b0339	ipv4: Don't store a rule pointer in fib_result. We only use it to fetch the rule's tclassid, so just store the tclassid there instead. This also decreases the size of fib_result by a full 8 bytes on 64-bit. On 32-bits it's a wash. Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-13 08:21:29 -07:00
Johannes Berg	4290cb4bf2	cfg80211: reduce monitor interface tracking Revert commit `b78e8ceac2` ("cfg80211: track monitor channel") and remove the set_monitor_enabled() callback. Due to the tracking happening in NETDEV_PRE_UP, it had introduced bugs because the monitor interface callback would be called before the device was started. It looks like there's no way to fix this, and using NETDEV_PRE_UP is broken anyway (since there's no NETDEV_UP_FAIL), so remove all that code, track interfaces in NETDEV_UP and also stop tracking the monitor channel in cfg80211. This mostly reverts to before the tracking, except that we keep the interface count tracking so that setting the monitor channel can be rejected properly. Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2012-07-13 16:16:11 +02:00
Johannes Berg	5b7ccaf3fc	cfg80211/mac80211: re-add get_channel operation This essentially reverts commit `2e165b8184` but introduces the get_channel operation with a new wireless_dev argument so that you can retrieve the channel per interface. This is necessary as even though we can track all interface channels (except monitor) we can't track the channel type used. Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2012-07-13 16:16:11 +02:00
Johannes Berg	075e08477d	Revert "mac80211: refactor virtual monitor code" This reverts commit `870d37fc22`. This code doesn't work as cfg80211 will call set_monitor_enabled at the wrong time and it doesn't seem to be possible to fix this. Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2012-07-13 16:16:10 +02:00
Alan Cox	4b4b8229ae	mac80211: fix use after free roc is destroyed then roc->started is referenced. Keep a local cache. Signed-off-by: Alan Cox <alan@linux.intel.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2012-07-13 16:15:54 +02:00
Eric Dumazet	d01cb20711	tcp: add LAST_ACK as a valid state for TSQ Socket state LAST_ACK should allow TSQ to send additional frames, or else we rely on incoming ACKS or timers to send them. Reported-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Matt Mathis <mattmathis@google.com> Cc: Mahesh Bandewar <maheshb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-13 05:48:36 -07:00
Alexander Duyck	540eb7bf0b	net: Update alloc frag to reduce get/put page usage and recycle pages This patch is meant to help improve performance by reducing the number of locked operations required to allocate a frag on x86 and other platforms. This is accomplished by using atomic_set operations on the page count instead of calling get_page and put_page. It is based on work originally provided by Eric Dumazet. In addition it also helps to reduce memory overhead when using TCP. This is done by recycling the page if the only holder of the frame is the netdev_alloc_frag call itself. This can occur when skb heads are stolen by either GRO or TCP and the driver providing the packets is using paged frags to store all of the data for the packets. Cc: Eric Dumazet <edumazet@google.com> Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-13 05:48:36 -07:00
Johannes Berg	ae33bd817a	nl80211: allow enabling WoWLAN without triggers It may be desirable to use WoWLAN without triggers to keep the connection alive to the AP while suspended. Allow this use by enabling WoWLAN without triggers if no triggers were requested. Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2012-07-12 22:30:34 +02:00
John W. Linville	d07d152892	Merge branch 'for-john' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211-next Conflicts: drivers/net/wireless/iwmc3200wifi/cfg80211.c drivers/net/wireless/mwifiex/cfg80211.c	2012-07-12 15:21:05 -04:00
Dave Jones	8a70e7f8f3	NFC: NCI module license 'unspecified' taints kernel Signed-off-by: Dave Jones <davej@redhat.com> Signed-off-by: Samuel Ortiz <sameo@linux.intel.com> Signed-off-by: John W. Linville <linville@tuxdriver.com>	2012-07-12 14:48:41 -04:00
Eric Lapuyade	81b3039557	NFC: Set target nfcid1 for all HCI reader A targets Without the discovered target nfcid1 and its length set properly, type 2 tags detection fails with the pn544 as it checks for them from pn544_hci_complete_target_discovered(). Signed-off-by: Eric Lapuyade <eric.lapuyade@intel.com> Reported-by: Mathias Jeppsson <mathias.jeppsson@sonymobile.com> Signed-off-by: Samuel Ortiz <sameo@linux.intel.com> Signed-off-by: John W. Linville <linville@tuxdriver.com>	2012-07-12 14:48:41 -04:00
John W. Linville	38a0084063	Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/linville/wireless-next into for-davem	2012-07-12 13:44:50 -04:00
David S. Miller	391e5c22f5	ipv4: Remove tb_peers from fib_table. No longer used. Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-12 09:39:28 -07:00
Alan Cox	7ac2908e4b	sch_sfb: Fix missing NULL check Resolves-bug: https://bugzilla.kernel.org/show_bug.cgi?id=44461 Signed-off-by: Alan Cox <alan@linux.intel.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-12 08:33:18 -07:00
NeilBrown	200724a707	SUNRPC/cache: fix reporting of expired cache entries in 'content' file. Entries that are in a sunrpc cache but are not valid should be reported with a leading '#' so they look like a comment. Commit `d202cce896` (sunrpc: never return expired entries in sunrpc_cache_lookup) broke this for expired entries. This particularly applies to entries that have been replaced by newer entries. sunrpc_cache_update sets the expiry of the replaced entry to '0', but it remains in the cache until the next 'cache_clean'. The result is that if you echo 0 2000000000 1 0 > /proc/net/rpc/auth.unix.gid/channel several times, then cat /proc/net/rpc/auth.unix.gid/content It will display multiple entries for the one uid, which is at least confusing: #uid cnt: gids... 0 1: 0 0 1: 0 0 1: 0 With this patch, expired entries are marked as comments so you get #uid cnt: gids... 0 1: 0 # 0 1: 0 # 0 1: 0 These expired entries will never be seen by cache_check() as they are always after a non-expired entry with the same key - so the extra check is only needed in c_show() Signed-off-by: NeilBrown <neilb@suse.de> -- It's not a big problem, but it had me confused for a while, so it could well confuse others. Thanks, NeilBrown Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2012-07-12 11:20:28 -04:00
David S. Miller	10b0a7732e	Merge branch 'for-davem' of git://gitorious.org/linux-can/linux-can-next	2012-07-12 08:18:45 -07:00
David S. Miller	f0a70e902f	ipv4: Put proper checks into icmp_socket_deliver(). All handler->err() routines expect that we've done a pskb_may_pull() test to make sure that IP header length + 8 bytes can be safely pulled. Reported-by: Hiroaki SHIMODA <shimoda.hiroaki@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-12 08:06:04 -07:00
Florian Westphal	6d4fa852a0	net: sched: add ipset ematch Can be used to match packets against netfilter ip sets created via ipset(8). skb->sk_iif is used as 'incoming interface', skb->dev is 'outgoing interface'. Since ipset is usually called from netfilter, the ematch initializes a fake xt_action_param, pulls the ip header into the linear area and also sets skb->data to the IP header (otherwise matching Layer 4 set types doesn't work). Tested-by: Mr Dash Four <mr.dash.four@googlemail.com> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-12 07:54:46 -07:00
alex.bluesman.smirnov@gmail.com	33c34c5e93	6lowpan: rework fragment-deleting routine 6lowpan module starts collecting incomming frames and fragments right after lowpan_module_init() therefor it will be better to clean unfinished fragments in lowpan_cleanup_module() function instead of doing it when link goes down. Changed spinlocks type to prevent deadlock with expired timer event and removed unused one. Signed-off-by: Alexander Smirnov <alex.bluesman.smirnov@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-12 07:54:46 -07:00

... 3 4 5 6 7 ...

24663 Commits