Commit Graph

442770 Commits

Author SHA1 Message Date
wenxiong@linux.vnet.ibm.com
0c0e63410a bnx2x: Adapter not recovery from EEH error injection
When injecting EEH error to bnx2x adapter, adapter couldn't be recovery
and caused recursive EEH errors. The patch fixes the issue.

Signed-off-by: Wen Xiong <wenxiong@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-06-03 18:37:38 -07:00
Balakumaran Kannan
31f6f291b6 net: driver: smsc: set NOCARRIER flag in dev at driver initialization
As smsc driver supports carrier detection, it should unset NOCARRIER
flag only after carrier state determination. By default that flag
is off so driver should set it before starting auto-negotiation

Signed-off-by: Balakumaran <Balakumaran.Kannan@ap.sony.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-06-03 18:20:38 -07:00
Michal Kubecek
21ee543edc xfrm: fix race between netns cleanup and state expire notification
The xfrm_user module registers its pernet init/exit after xfrm
itself so that its net exit function xfrm_user_net_exit() is
executed before xfrm_net_exit() which calls xfrm_state_fini() to
cleanup the SA's (xfrm states). This opens a window between
zeroing net->xfrm.nlsk pointer and deleting all xfrm_state
instances which may access it (via the timer). If an xfrm state
expires in this window, xfrm_exp_state_notify() will pass null
pointer as socket to nlmsg_multicast().

As the notifications are called inside rcu_read_lock() block, it
is sufficient to retrieve the nlsk socket with rcu_dereference()
and check the it for null.

Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-06-03 16:07:44 -07:00
David S. Miller
1299b3c49b Merge branch 'cnic'
Michael Chan says:

====================
cnic fixes

Fix 2 sleeping function from invalid context bugs and 1 missing iscsi netlink
message bug.

v2: Fixed typo in rcu_dereference_protected() and tested with CONFIG_PROVE_RCU
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2014-06-02 23:17:03 -07:00
Michael Chan
5943691442 cnic: Fix missing ISCSI_KEVENT_IF_DOWN message
The iSCSI netlink message needs to be sent before the ulp_ops is cleared
as it is sent through a function pointer in the ulp_ops.  This bug
causes iscsid to not get the message when the bnx2i driver is unloaded.

Signed-off-by: Michael Chan <mchan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-06-02 23:16:41 -07:00
Michael Chan
437b8a26f9 cnic: Don't take cnic_dev_lock in cnic_alloc_uio_rings()
We are allocating memory with GFP_KERNEL under spinlock.  Since this is
the only call manipulating the cnic_udev_list and it is always under
rtnl_lock, cnic_dev_lock can be safely removed.

Signed-off-by: Michael Chan <mchan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-06-02 23:16:41 -07:00
Michael Chan
20f30c2d5e cnic: Don't take rcu_read_lock in cnic_rcv_netevent()
Because the called function, such as bnx2fc_indicate_netevent(), can sleep,
we cannot take rcu_lock().  To prevent the rcu protected ulp_ops from going
away, we use the cnic_lock mutex and set the ULP_F_CALL_PENDING flag.
The code already waits for ULP_F_CALL_PENDING flag to clear in
cnic_unregister_device().

Signed-off-by: Michael Chan <mchan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-06-02 23:16:41 -07:00
Christian Riesch
74f43922dc net: davinci_emac: Remove unwanted debug/error message
In commit cd11cf5053 I accidentally
added an error message. I used it for debugging and forgot to remove
it before submitting the patch.

Signed-off-by: Christian Riesch <christian.riesch@omicron.at>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-06-02 23:13:38 -07:00
Linus Torvalds
cae61ba37b Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes from David Miller:

 1) Unbreak zebra and other netlink apps, from Eric W Biederman.

 2) Some new qmi_wwan device IDs, from Aleksander Morgado.

 3) Fix info leak in DCB netlink handler of qlcnic driver, from Dan
    Carpenter.

 4) inet_getid() and ipv6_select_ident() do not generate monotonically
    increasing ID numbers, fix from Eric Dumazet.

 5) Fix memory leak in __sk_prepare_filter(), from Leon Yu.

 6) Netlink leftover bytes warning message is user triggerable, rate
    limit it.  From Michal Schmidt.

 7) Fix non-linear SKB panic in ipvs, from Peter Christensen.

 8) Congestion window undo needs to be performed even if only never
    retransmitted data is SACK'd, fix from Yuching Cheng.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (24 commits)
  net: filter: fix possible memory leak in __sk_prepare_filter()
  net: ec_bhf: Add runtime dependencies
  tcp: fix cwnd undo on DSACK in F-RTO
  netlink: Only check file credentials for implicit destinations
  ipheth: Add support for iPad 2 and iPad 3
  team: fix mtu setting
  net: fix inet_getid() and ipv6_select_ident() bugs
  net: qmi_wwan: interface #11 in Sierra Wireless MC73xx is not QMI
  net: qmi_wwan: add additional Sierra Wireless QMI devices
  bridge: Prevent insertion of FDB entry with disallowed vlan
  netlink: rate-limit leftover bytes warning and print process name
  bridge: notify user space after fdb update
  net: qmi_wwan: add Netgear AirCard 341U
  net: fix wrong mac_len calculation for vlans
  batman-adv: fix NULL pointer dereferences
  net/mlx4_core: Reset RoCE VF gids when guest driver goes down
  emac: aggregation of v1-2 PLB errors for IER register
  emac: add missing support of 10mbit in emac/rgmii
  can: only rename enabled led triggers when changing the netdev name
  ipvs: Fix panic due to non-linear skb
  ...
2014-06-02 18:16:41 -07:00
Leon Yu
418c96ac15 net: filter: fix possible memory leak in __sk_prepare_filter()
__sk_prepare_filter() was reworked in commit bd4cf0ed3 (net: filter:
rework/optimize internal BPF interpreter's instruction set) so that it should
have uncharged memory once things went wrong. However that work isn't complete.
Error is handled only in __sk_migrate_filter() while memory can still leak in
the error path right after sk_chk_filter().

Fixes: bd4cf0ed33 ("net: filter: rework/optimize internal BPF interpreter's instruction set")
Signed-off-by: Leon Yu <chianglungyu@gmail.com>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Tested-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-06-02 17:49:45 -07:00
Linus Torvalds
ca755175f2 Two md bugfixes for possible corruption when restarting reshape
If a raid5/6 reshape is restarted (After stopping and re-assembling
 the array) and the array is marked read-only (or read-auto), then
 the reshape will appear to complete immediately, without actually
 moving anything around.  This can result in corruption.
 
 There are two patches which do much the same thing in different places.
 They are separate because one is an older bug and so can be applied to
 more -stable kernels.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.22 (GNU/Linux)
 
 iQIUAwUAU40KTDnsnt1WYoG5AQJcaA/1GkoZit6LqLjiIQmsK9Ci/4TI+sNqYaSB
 9SleSjWt+bcNCRY4sS3Wv0H580LmkoRR24wdei+mukoFa+bpBBs6PodPMABAVsnL
 VxlnUX+P4Ef77s2zJ8B5wCY3ftmecaQL3TdZf10+hIITacXSp7JmsLJXm3DW+Jvq
 DZsxJRBEQfsz5obZAZXnvPAcTSkqMT4QQ13nIEmaYEz+AYVn6Tcf8xwDBOcZM4u9
 Gdi6BHNaY6RjSU1gsVblPYmWQyqqdgCJ6UEV/KYyY9rtFyozkvJ0SDWcu/kRA74A
 uydN5U6iVqJatY9l9eK2tV7GQkN+o+MWIA0JocTRZe67ihE4tWxiLRn/7fZdLVsX
 TV6zYar0M/ZSn3XioGi4hQ0tWDPpq/aCzCAk5JQpywgBmoaMqqh8rttwdCkWvK6P
 TNnaVfo3r9AMJY8MVm8in/efEhY6jUa3q2oDqCEKjuL916v9ODsxXloqTlbEy2KC
 NrKNLCZA2subbzPa3T8u4aKRBzl0xSBSig8ecrufSpDC1I0G+Mbuc8wrDzjAnI3N
 +fbQCxxRR0akcleZrFZD67avOa5/DsQqWJbcW1D5VCekJoZcgdz5CGJz/bNl+0i6
 bwrvNWi6q1X2P4Nt2BBhk771xzNiUlufsI0x7SFIJxpDiGlxINkluXvnEQKFSzhr
 uYSrvTCQwg==
 =cTEe
 -----END PGP SIGNATURE-----

Merge tag 'md/3.15-fixes' of git://neil.brown.name/md

Pull two md bugfixes from Neil Brown:
 "Two md bugfixes for possible corruption when restarting reshape

  If a raid5/6 reshape is restarted (After stopping and re-assembling
  the array) and the array is marked read-only (or read-auto), then the
  reshape will appear to complete immediately, without actually moving
  anything around.  This can result in corruption.

  There are two patches which do much the same thing in different
  places.  They are separate because one is an older bug and so can be
  applied to more -stable kernels"

* tag 'md/3.15-fixes' of git://neil.brown.name/md:
  md: always set MD_RECOVERY_INTR when interrupting a reshape thread.
  md: always set MD_RECOVERY_INTR when aborting a reshape or other "resync".
2014-06-02 17:04:37 -07:00
Jean Delvare
3aab01d800 net: ec_bhf: Add runtime dependencies
The ec_bhf driver is specific to the Beckhoff CX embedded PC series.
These are based on Intel x86 CPU. So we can add a dependency on
X86, with COMPILE_TEST as an alternative to still allow for broader
build-testing.

Signed-off-by: Jean Delvare <jdelvare@suse.de>
Cc: Darek Marcinkiewicz <reksio@newterm.pl>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-06-02 17:02:28 -07:00
Martin K. Petersen
3b8d2676d1 libata: Blacklist queued trim for Crucial M500
Queued trim only works for some users with MU05 firmware.  Revert to
blacklisting all firmware versions.

Introduced by commit d121f7d0cb ("libata: Update queued trim blacklist
for M5x0 drives") which this effectively reverts, while retaining the
blacklisting of M550.

See

    https://bugzilla.kernel.org/show_bug.cgi?id=71371

for reports of trouble with MU05 firmware.

Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Cc: Tejun Heo <tj@kernel.org>
Cc: stable@vger.kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-02 16:59:25 -07:00
Linus Torvalds
92b4e11315 Merge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fix from Peter Anvin:
 "A single quite small patch that managed to get overlooked earlier, to
  prevent a user space triggerable oops on systems without HPET"

* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86, vdso: Fix an OOPS accessing the HPET mapping w/o an HPET
2014-06-02 16:57:23 -07:00
Linus Torvalds
8ee7a330fb USB fixes for 3.15-rc8
Here are some fixes for 3.15-rc8 that resolve a number of tiny USB
 issues that have been reported, and there are some new device ids as
 well.
 
 All have been tested in linux-next.
 
 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.22 (GNU/Linux)
 
 iEYEABECAAYFAlOM9RkACgkQMUfUDdst+ykhbACeKn9kmESO0QMC2ST0kQkQxHJX
 olYAoKbYZxvlZbdSJBmtYDbm1c5wrCfO
 =H5ae
 -----END PGP SIGNATURE-----

Merge tag 'usb-3.15-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb

Pull USB fixes from Greg KH:
 "Here are some fixes for 3.15-rc8 that resolve a number of tiny USB
  issues that have been reported, and there are some new device ids as
  well.

  All have been tested in linux-next"

* tag 'usb-3.15-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb:
  xhci: delete endpoints from bandwidth list before freeing whole device
  usb: pci-quirks: Prevent Sony VAIO t-series from switching usb ports
  USB: cdc-wdm: properly include types.h
  usb: cdc-wdm: export cdc-wdm uapi header
  USB: serial: option: add support for Novatel E371 PCIe card
  USB: ftdi_sio: add NovaTech OrionLXm product ID
  USB: io_ti: fix firmware download on big-endian machines (part 2)
  USB: Avoid runtime suspend loops for HCDs that can't handle suspend/resume
2014-06-02 16:56:42 -07:00
Linus Torvalds
da579dd6a1 Staging driver fixes for 3.15-rc8
Here are some staging driver fixes for 3.15.  3 are for the speakup
 drivers (one fix a regression caused in 3.15-rc, and the other 2 resolve
 a tty issue found by Ben Hutchings)  The comedi and r8192e_pci driver
 fixes also resolve reported issues.
 
 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.22 (GNU/Linux)
 
 iEYEABECAAYFAlOM9eUACgkQMUfUDdst+ykETACbBTSOZVE1pAIKGjGY2bcOtQET
 yKMAnil04hAbmz5L/5TNvdkUWu+7SfGi
 =K8cr
 -----END PGP SIGNATURE-----

Merge tag 'staging-3.15-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging

Pull staging driver fixes from Greg KH:
 "Here are some staging driver fixes for 3.15.

  Three are for the speakup drivers (one fixes a regression caused in
  3.15-rc, and the other two resolve a tty issue found by Ben Hutchings)
  The comedi and r8192e_pci driver fixes also resolve reported issues"

* tag 'staging-3.15-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging:
  staging: r8192e_pci: fix htons error
  Staging: speakup: Update __speakup_paste_selection() tty (ab)usage to match vt
  Staging: speakup: Move pasting into a work item
  staging: comedi: ni_daq_700: add mux settling delay
  speakup: fix incorrect perms on speakup_acntsa.c
2014-06-02 16:55:18 -07:00
Yuchung Cheng
0cfa5c07d6 tcp: fix cwnd undo on DSACK in F-RTO
This bug is discovered by an recent F-RTO issue on tcpm list
https://www.ietf.org/mail-archive/web/tcpm/current/msg08794.html

The bug is that currently F-RTO does not use DSACK to undo cwnd in
certain cases: upon receiving an ACK after the RTO retransmission in
F-RTO, and the ACK has DSACK indicating the retransmission is spurious,
the sender only calls tcp_try_undo_loss() if some never retransmisted
data is sacked (FLAG_ORIG_DATA_SACKED).

The correct behavior is to unconditionally call tcp_try_undo_loss so
the DSACK information is used properly to undo the cwnd reduction.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-06-02 16:50:49 -07:00
Eric W. Biederman
2d7a85f4b0 netlink: Only check file credentials for implicit destinations
It was possible to get a setuid root or setcap executable to write to
it's stdout or stderr (which has been set made a netlink socket) and
inadvertently reconfigure the networking stack.

To prevent this we check that both the creator of the socket and
the currentl applications has permission to reconfigure the network
stack.

Unfortunately this breaks Zebra which always uses sendto/sendmsg
and creates it's socket without any privileges.

To keep Zebra working don't bother checking if the creator of the
socket has privilege when a destination address is specified.  Instead
rely exclusively on the privileges of the sender of the socket.

Note from Andy: This is exactly Eric's code except for some comment
clarifications and formatting fixes.  Neither I nor, I think, anyone
else is thrilled with this approach, but I'm hesitant to wait on a
better fix since 3.15 is almost here.

Note to stable maintainers: This is a mess.  An earlier series of
patches in 3.15 fix a rather serious security issue (CVE-2014-0181),
but they did so in a way that breaks Zebra.  The offending series
includes:

    commit aa4cf9452f
    Author: Eric W. Biederman <ebiederm@xmission.com>
    Date:   Wed Apr 23 14:28:03 2014 -0700

        net: Add variants of capable for use on netlink messages

If a given kernel version is missing that series of fixes, it's
probably worth backporting it and this patch.  if that series is
present, then this fix is critical if you care about Zebra.

Cc: stable@vger.kernel.org
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-06-02 16:34:09 -07:00
Kristian Evensen
22fd2a52f7 ipheth: Add support for iPad 2 and iPad 3
Each iPad model has a different product id, this patch adds support for iPad 2
(pid 0x12a2) and iPad 3 (pid 0x12a6). Note that iPad 2 must be jailbroken and a
third-party app must be used for tethering to work. On iPad 3, tethering works
out of the box (assuming your ISP is nice).

Signed-off-by: Kristian Evensen <kristian.evensen@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-06-02 16:11:55 -07:00
Jiri Pirko
9d0d68faea team: fix mtu setting
Now it is not possible to set mtu to team device which has a port
enslaved to it. The reason is that when team_change_mtu() calls
dev_set_mtu() for port device, notificator for NETDEV_PRECHANGEMTU
event is called and team_device_event() returns NOTIFY_BAD forbidding
the change. So fix this by returning NOTIFY_DONE here in case team is
changing mtu in team_change_mtu().

Introduced-by: 3d249d4c "net: introduce ethernet teaming device"
Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Acked-by: Flavio Leitner <fbl@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-06-02 14:56:01 -07:00
Eric Dumazet
39c36094d7 net: fix inet_getid() and ipv6_select_ident() bugs
I noticed we were sending wrong IPv4 ID in TCP flows when MTU discovery
is disabled.
Note how GSO/TSO packets do not have monotonically incrementing ID.

06:37:41.575531 IP (id 14227, proto: TCP (6), length: 4396)
06:37:41.575534 IP (id 14272, proto: TCP (6), length: 65212)
06:37:41.575544 IP (id 14312, proto: TCP (6), length: 57972)
06:37:41.575678 IP (id 14317, proto: TCP (6), length: 7292)
06:37:41.575683 IP (id 14361, proto: TCP (6), length: 63764)

It appears I introduced this bug in linux-3.1.

inet_getid() must return the old value of peer->ip_id_count,
not the new one.

Lets revert this part, and remove the prevention of
a null identification field in IPv6 Fragment Extension Header,
which is dubious and not even done properly.

Fixes: 87c48fa3b4 ("ipv6: make fragment identifications less predictable")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-06-02 14:09:28 -07:00
Aleksander Morgado
fc0d6e9cd0 net: qmi_wwan: interface #11 in Sierra Wireless MC73xx is not QMI
This interface is unusable, as the cdc-wdm character device doesn't reply to
any QMI command. Also, the out-of-tree Sierra Wireless GobiNet driver fully
skips it.

Signed-off-by: Aleksander Morgado <aleksander@aleksander.es>
Acked-by: Bjørn Mork <bjorn@mork.no>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-06-02 14:00:28 -07:00
Aleksander Morgado
9a793e71eb net: qmi_wwan: add additional Sierra Wireless QMI devices
A set of new VID/PIDs retrieved from the out-of-tree GobiNet/GobiSerial
Sierra Wireless drivers.

Signed-off-by: Aleksander Morgado <aleksander@aleksander.es>
Acked-by: Bjørn Mork <bjorn@mork.no>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-06-02 14:00:28 -07:00
Toshiaki Makita
e0d7968ab6 bridge: Prevent insertion of FDB entry with disallowed vlan
br_handle_local_finish() is allowing us to insert an FDB entry with
disallowed vlan. For example, when port 1 and 2 are communicating in
vlan 10, and even if vlan 10 is disallowed on port 3, port 3 can
interfere with their communication by spoofed src mac address with
vlan id 10.

Note: Even if it is judged that a frame should not be learned, it should
not be dropped because it is destined for not forwarding layer but higher
layer. See IEEE 802.1Q-2011 8.13.10.

Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Acked-by: Vlad Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-06-02 13:38:23 -07:00
Michal Schmidt
bfc5184b69 netlink: rate-limit leftover bytes warning and print process name
Any process is able to send netlink messages with leftover bytes.
Make the warning rate-limited to prevent too much log spam.

The warning is supposed to help find userspace bugs, so print the
triggering command name to implicate the buggy program.

[v2: Use pr_warn_ratelimited instead of printk_ratelimited.]

Signed-off-by: Michal Schmidt <mschmidt@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-06-02 11:16:11 -07:00
Jon Maxwell
c65c7a3066 bridge: notify user space after fdb update
There has been a number incidents recently where customers running KVM have
reported that VM hosts on different Hypervisors are unreachable. Based on
pcap traces we found that the bridge was broadcasting the ARP request out
onto the network. However some NICs have an inbuilt switch which on occasions
were broadcasting the VMs ARP request back through the physical NIC on the
Hypervisor. This resulted in the bridge changing ports and incorrectly learning
that the VMs mac address was external. As a result the ARP reply was directed
back onto the external network and VM never updated it's ARP cache. This patch
will notify the bridge command, after a fdb has been updated to identify such
port toggling.

Signed-off-by: Jon Maxwell <jmaxwell37@gmail.com>
Reviewed-by: Jiri Pirko <jiri@resnulli.us>
Acked-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Acked-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-06-01 22:14:50 -07:00
Aleksander Morgado
4324be1e0b net: qmi_wwan: add Netgear AirCard 341U
Signed-off-by: Aleksander Morgado <aleksander@aleksander.es>
Acked-by: Bjørn Mork <bjorn@mork.no>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-06-01 19:43:57 -07:00
Nikolay Aleksandrov
4b9b1cdf83 net: fix wrong mac_len calculation for vlans
After 1e785f48d2 ("net: Start with correct mac_len in
skb_network_protocol") skb->mac_len is used as a start of the
calculation in skb_network_protocol() but that is not always correct. If
skb->protocol == 8021Q/AD, usually the vlan header is already inserted
in the skb (i.e. vlan reorder hdr == 0). Usually when the packet enters
dev_hard_xmit it has mac_len == 0 so we take 2 bytes from the
destination mac address (skb->data + VLAN_HLEN) as a type in
skb_network_protocol() and return vlan_depth == 4. In the case where TSO is
off, then the mac_len is set but it's == 18 (ETH_HLEN + VLAN_HLEN), so
skb_network_protocol() returns a type from inside the packet and
offset == 22. Also make vlan_depth unsigned as suggested before.
As suggested by Eric Dumazet, move the while() loop in the if() so we
can avoid additional testing in fast path.

Here are few netperf tests + debug printk's to illustrate:
cat netperf.tso-on.reorder-on.bugged
- Vlan -> device (reorder on, default, this case is okay)
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to
192.168.3.1 () port 0 AF_INET
Recv   Send    Send
Socket Socket  Message  Elapsed
Size   Size    Size     Time     Throughput
bytes  bytes   bytes    secs.    10^6bits/sec

 87380  16384  16384    10.00    7111.54
[   81.605435] skb->len 65226 skb->gso_size 1448 skb->proto 0x800
skb->mac_len 0 vlan_depth 0 type 0x800

- Vlan -> device (reorder off, bad)
cat netperf.tso-on.reorder-off.bugged
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to
192.168.3.1 () port 0 AF_INET
Recv   Send    Send
Socket Socket  Message  Elapsed
Size   Size    Size     Time     Throughput
bytes  bytes   bytes    secs.    10^6bits/sec

 87380  16384  16384    10.00     241.35
[  204.578332] skb->len 1518 skb->gso_size 0 skb->proto 0x8100
skb->mac_len 0 vlan_depth 4 type 0x5301
0x5301 are the last two bytes of the destination mac.

And if we stop TSO, we may get even the following:
[   83.343156] skb->len 2966 skb->gso_size 1448 skb->proto 0x8100
skb->mac_len 18 vlan_depth 22 type 0xb84
Because mac_len already accounts for VLAN_HLEN.

After the fix:
cat netperf.tso-on.reorder-off.fixed
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to
192.168.3.1 () port 0 AF_INET
Recv   Send    Send
Socket Socket  Message  Elapsed
Size   Size    Size     Time     Throughput
bytes  bytes   bytes    secs.    10^6bits/sec

 87380  16384  16384    10.01    5001.46
[   81.888489] skb->len 65230 skb->gso_size 1448 skb->proto 0x8100
skb->mac_len 0 vlan_depth 18 type 0x800

CC: Vlad Yasevich <vyasevic@redhat.com>
CC: Eric Dumazet <eric.dumazet@gmail.com>
CC: Daniel Borkman <dborkman@redhat.com>
CC: David S. Miller <davem@davemloft.net>

Fixes:1e785f48d29a ("net: Start with correct mac_len in
skb_network_protocol")
Signed-off-by: Nikolay Aleksandrov <nikolay@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-06-01 19:39:13 -07:00
Linus Torvalds
fad01e866a Linux 3.15-rc8 2014-06-01 19:12:24 -07:00
Linus Torvalds
204fe0380b Merge branch 'merge' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc
Pull powerpc fix from Ben Herrenschmidt:
 "Here's just one trivial patch to wire up sys_renameat2 which I seem to
  have completely missed so far.

  (My test build scripts fwd me warnings but miss the ones generated for
  missing syscalls)"

* 'merge' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc:
  powerpc: Wire renameat2() syscall
2014-06-01 18:30:07 -07:00
Linus Torvalds
568180a517 Merge branch 'upstream' of git://git.linux-mips.org/pub/scm/ralf/upstream-linus
Pull MIPS fixes from Ralf Baechle:
 "A fair number of fixes across the field.  Nothing terribly
  complicated; the one liners in below changelog should be fairly
  descriptive.

  Noteworthy is the SB1 change which the result of changes to binutils
  resulting in one big gas warning for most files being assembled as
  well as the asid_cache and branch emulation fixes which fix corruption
  or possible uninteded behaviour of kernel or application code.  The
  remainder of fixes are more platforms or subsystem specific"

* 'upstream' of git://git.linux-mips.org/pub/scm/ralf/upstream-linus:
  MIPS: R46000: Fix Micro-assembler field overflow for R4600 V2
  MIPS: ptrace: Avoid smp_processor_id() in preemptible code
  MIPS: Lemote 2F: cs5536: mfgpt: use raw locks
  MIPS: SB1: Fix excessive kernel warnings.
  MIPS: RC32434: fix broken PCI resource initialization
  MIPS: malta: memory.c: Initialize the 'memsize' variable
  MIPS: Fix typo when reporting cache and ftlb errors for ImgTec cores
  MIPS: Fix inconsistancy of __NR_Linux_syscalls value
  MIPS: Fix branch emulation of branch likely instructions.
  MIPS: Fix a typo error in AUDIT_ARCH definition
  MIPS: Change type of asid_cache to unsigned long
2014-06-01 18:28:58 -07:00
Linus Torvalds
32439700fe Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fixes from Ingo Molnar:
 "Various fixlets, mostly related to the (root-only) SCHED_DEADLINE
  policy, but also a hotplug bug fix and a fix for a NR_CPUS related
  overallocation bug causing a suspend/resume regression"

* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched: Fix hotplug vs. set_cpus_allowed_ptr()
  sched/cpupri: Replace NR_CPUS arrays
  sched/deadline: Replace NR_CPUS arrays
  sched/deadline: Restrict user params max value to 2^63 ns
  sched/deadline: Change sched_getparam() behaviour vs SCHED_DEADLINE
  sched: Disallow sched_attr::sched_policy < 0
  sched: Make sched_setattr() correctly return -EFBIG
2014-06-01 18:26:59 -07:00
Benjamin Herrenschmidt
8212f58a9b powerpc: Wire renameat2() syscall
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2014-06-02 09:24:27 +10:00
David S. Miller
6ce995c6f4 Included changes:
- prevent NULL dereference in multicast code
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.22 (GNU/Linux)
 
 iQIcBAABCAAGBQJTiZ58AAoJEJgn97Bh2u9e2x4QAIjWLDIo6feo4jH8l6q6R5iO
 cH/EXCtqk9GHNwvfZDNt+pF19ejzVk/TPnmmXTZ4QcElS9GuXe5WdWxiGcS5KEwa
 0UNDRp8fgcBSV1Kqc/vbyKiQ4j69QtC1PPfLWUtxj/GYE0qHX/A1OzB9zROvoHJ7
 sa3l8O5XRWiaxBYDkT0RfhHH0jeDdvm3I9yt8B+4B6c71094VIsfGXBVPp4tPrdg
 nkuzBdwF0HFPiFrlsfboJDLcXPLpRR93H1GsmfELYd5jQ4rtUhlcuEESq6573tvB
 TV93tkm/zmbwtInMoPI29qKL8t2478cJH7SvKvM4NiqMsB1zOhknhUXzElh9TPGA
 xyNivxJraYJzL53XguBFO8A8fP1k/E8Z6UQXJbgry4lu+6qZ60e0/J8zGxGpSamP
 i1JX0MAVPX6T4MAlZ70LMxfmzJ5sSNkkYyXobG+aBa/AgzRsXVvG4So1qi364COx
 btCxgBXK1Z20ZuNclY8/J06D8EbTXI5y5MCSDvMCOHQlb5mjBl34RtFVw+5/QXkg
 v2suc7T/YLOPNtZktZC2506caPHoOlwEVvkyA55p+qdkcD/Dd5Iv4Hndi+g+C5gv
 O2ja7gUQco1R8ElormKW9rE7OvjiUlowNJmguXWAdzc9FC0yISpP66BAGjBqwhF9
 6YibEebXMQICjxAVTEAM
 =fvfu
 -----END PGP SIGNATURE-----

Merge tag 'batman-adv-fix-for-davem' of git://git.open-mesh.org/linux-merge

Included changes:
- prevent NULL dereference in multicast code

Antonion Quartulli says:

====================
pull request net: batman-adv 20140527

here you have another very small fix intended for net/linux-3.15.
It prevents some multicast functions from dereferencing a NULL pointer.
(Actually it was nothing more than a typo)
I hope it is not too late for such a small patch.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2014-05-31 20:01:47 -07:00
Linus Torvalds
a4bf79eb6a Merge branch 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull core futex/rtmutex fixes from Thomas Gleixner:
 "Three fixlets for long standing issues in the futex/rtmutex code
  unearthed by Dave Jones syscall fuzzer:

   - Add missing early deadlock detection checks in the futex code
   - Prevent user space from attaching a futex to kernel threads
   - Make the deadlock detector of rtmutex work again

  Looks large, but is more comments than code change"

* 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  rtmutex: Fix deadlock detector for real
  futex: Prevent attaching to kernel threads
  futex: Add another early deadlock detection check
2014-05-31 09:47:55 -07:00
Linus Torvalds
80e0679469 Merge branch 'drm-fixes' of git://people.freedesktop.org/~airlied/linux
Pull drm fixes from Dave Airlie:
 "Mostly quiet now:

  i915:
    fixing userspace visiblie issues, all stable marked

  radeon:
    one more pll fix, two crashers, one suspend/resume regression"

* 'drm-fixes' of git://people.freedesktop.org/~airlied/linux:
  drm/radeon: Resume fbcon last
  drm/radeon: only allocate necessary size for vm bo list
  drm/radeon: don't allow RADEON_GEM_DOMAIN_CPU for command submission
  drm/radeon: avoid crash if VM command submission isn't available
  drm/radeon: lower the ref * post PLL maximum once more
  drm/i915: Prevent negative relocation deltas from wrapping
  drm/i915: Only copy back the modified fields to userspace from execbuffer
  drm/i915: Fix dynamic allocation of physical handles
2014-05-31 09:19:02 -07:00
Linus Torvalds
9f12600fe4 dcache: add missing lockdep annotation
lock_parent() very much on purpose does nested locking of dentries, and
is careful to maintain the right order (lock parent first).  But because
it didn't annotate the nested locking order, lockdep thought it might be
a deadlock on d_lock, and complained.

Add the proper annotation for the inner locking of the child dentry to
make lockdep happy.

Introduced by commit 046b961b45 ("shrink_dentry_list(): take parent's
->d_lock earlier").

Reported-and-tested-by: Josh Boyer <jwboyer@fedoraproject.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-05-31 09:13:21 -07:00
Marek Lindner
af0a171c07 batman-adv: fix NULL pointer dereferences
Was introduced with 4c8755d69c
("batman-adv: Send multicast packets to nodes with a WANT_ALL flag")

Reported-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Acked-by: Antonio Quartulli <antonio@meshcoding.com>
Signed-off-by: Linus Lüssing <linus.luessing@web.de>
Signed-off-by: Antonio Quartulli <antonio@meshcoding.com>
2014-05-31 10:07:14 +02:00
David S. Miller
dbfc4b698a Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf
Pablo Neira Ayuso says:

====================
The following patchset contains a late fix for IPVS:

* Fix crash when trying to remove the transport header with non-linear
  skbuffs, this was introduced in 3.6-rc. Patch from Peter Christensen
  via the IPVS folks.

I'll pass this to -stable once this hits mainstream.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2014-05-30 17:56:09 -07:00
David S. Miller
554215c5ca linux-can-fixes-for-3.15-20140528
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iEYEABECAAYFAlOFkZ4ACgkQjTAFq1RaXHOYCQCeJwwa8XQMw2HkGemGMBSpcUHb
 MLwAoISCoqgrXWq/lBBzDqlVXik8fR4t
 =gHVg
 -----END PGP SIGNATURE-----

Merge tag 'linux-can-fixes-for-3.15-20140528' of git://gitorious.org/linux-can/linux-can

Marc Kleine-Budde says:

====================
pull-request: can 2014-05-28

here's a pull request for v3.15, hope it's not too late.

Oliver Hartkopp fixed a bug in the CAN led trigger device renaming code.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2014-05-30 17:32:53 -07:00
Jack Morgenstein
111c6094bd net/mlx4_core: Reset RoCE VF gids when guest driver goes down
Reset the GIDs assigned to a VF in the port RoCE GID table when
that guest goes down (either crashes or goes down cleanly).

As part of this fix, we refactor the RoCE gid table driver copy,
moving it to the mlx4_port_info structure (together with the MAC
and VLAN tables).

As with the MAC and VLAN tables, we now use a mutex per port
for the GID table so that modifying the driver copy and
modifying the firmware copy of a port GID table becomes an
atomic operation (thus avoiding driver-copy/FW-copy mismatches).

Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-05-30 16:57:52 -07:00
Ivan Mikhaylov
09271db6e0 emac: aggregation of v1-2 PLB errors for IER register
Aggreagation of version 1-2 because of version 1 can hit
PLB errors too. If it's not set so we missing events for PLB bits
and driver can't process those interrupts.

Signed-off-by: Ivan Mikhaylov <ivan@ru.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-05-30 16:29:57 -07:00
Ivan Mikhaylov
faacd3af0c emac: add missing support of 10mbit in emac/rgmii
In chips of emac/rgmii b'000' for 0/1 channel isn't suitable which
resulted in non working network interface in this mode.

Signed-off-by: Ivan Mikhaylov <ivan@ru.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-05-30 16:29:57 -07:00
Daniel Vetter
18ee37a485 drm/radeon: Resume fbcon last
So a few people complained that

commit 177cf92de4
Author: Daniel Vetter <daniel.vetter@ffwll.ch>
Date:   Tue Apr 1 22:14:59 2014 +0200

    drm/crtc-helpers: fix dpms on logic

which was merged into 3.15-rc1, broke resume on radeons. Strangely git
bisect lead everyone to

commit 25f397a429
Author: Daniel Vetter <daniel.vetter@ffwll.ch>
Date:   Fri Jul 19 18:57:11 2013 +0200

    drm/crtc-helper: explicit DPMS on after modeset

which was merged long ago and actually part of 3.14.

Digging deeper I've noticed (again) that the call to
drm_helper_resume_force_mode in the radeon resume handlers was a no-op
previously because everything gets shut down on suspend. radeon does
this with explicit calls to drm_helper_connector_dpms with DPMS_OFF.
But with 177c we now force the dpms state to ON, so suddenly
resume_force_mode actually forced the crtcs back on.

This is the intention of the change after all, the problem is that
radeon resumes the fbdev console layer _before_ restoring the display,
through calling fb_set_suspend. And fbcon does an immediate ->set_par,
which in turn causes the same forced mode restore to happen.

Two concurrent modeset operations didn't lead to happiness. Fix this
by delaying the fbcon resume until the end of the readeon resum
functions.

v2: Fix up a bit of the spelling fail.

References: https://lkml.org/lkml/2014/5/29/1043
References: https://lkml.org/lkml/2014/5/2/388
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=74751
Tested-by: Ken Moffat <zarniwhoop@ntlworld.com>
Cc: Alex Deucher <alexdeucher@gmail.com>
Cc: Ken Moffat <zarniwhoop@ntlworld.com>
Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Signed-off-by: Dave Airlie <airlied@gmail.com>
2014-05-31 09:19:51 +10:00
Dave Airlie
1446e04c9b Merge branch 'drm-fixes-3.15' of git://people.freedesktop.org/~deathsimple/linux into drm-fixes
this is the next pull request for stashed up radeon fixes for 3.15. This is finally calming down with only four patches in this pull request.

* 'drm-fixes-3.15' of git://people.freedesktop.org/~deathsimple/linux:
  drm/radeon: only allocate necessary size for vm bo list
  drm/radeon: don't allow RADEON_GEM_DOMAIN_CPU for command submission
  drm/radeon: avoid crash if VM command submission isn't available
  drm/radeon: lower the ref * post PLL maximum once more
2014-05-31 09:19:05 +10:00
Linus Torvalds
1487385edb Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input
Pull input subsystem fixes from Dmitry Torokhov:
 "A couple of driver/build fixups and also redone quirk for Synaptics
  touchpads on Lenovo boxes (now using PNP IDs instead of DMI data to
  limit number of quirks)"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input:
  Input: synaptics - change min/max quirk table to pnp-id matching
  Input: synaptics - add a matches_pnp_id helper function
  Input: synaptics - T540p - unify with other LEN0034 models
  Input: synaptics - add min/max quirk for the ThinkPad W540
  Input: ambakmi - request a shared interrupt for AMBA KMI devices
  Input: pxa27x-keypad - fix generating scancode
  Input: atmel-wm97xx - only build for AVR32
  Input: fix ps2/serio module dependency
2014-05-30 12:07:48 -07:00
Linus Torvalds
1326af2464 Regression fix for the IEEE 1394 subsystem:
Re-enable IRQ-based asynchronous request reception at addresses below 128 TB.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.22 (GNU/Linux)
 
 iQIcBAABAgAGBQJTiIlwAAoJEHnzb7JUXXnQ+4oQAJ3ZHtqDJNP99qokiyVorDSw
 hk2b/nCF2A7PYGgYdyImDx+xiOdI7FE3T6vpDBrel/Vfc/eny3g1PSmklF0pmrgR
 LKyEAKmqRrEFKNevJPIAOyvf8jMV4E7mkKKp2iGN5s77CdW/KeoNfeP8yzRzMW3n
 ChG9yZIPWMVuoiGBL74iOZujETP3Y62CVRP9Anz3uG2I7oGQN+09uK+Qc58SR917
 fbvJh8f2Yrf+st7oogwpFvKRlDqyskmoonWc39OCPcGeumQ9LycTs5KTMx4htVjW
 jf8bJDQ/xmHCvPpZpQtZiMbj5pmMkD9cSt0KvMHv4TuV/1NBFECITjoQ03//+byb
 RmxTH8GRryfpafTLL1Qlf+boecjEK9yuJ/k1YbDEt9YTc6x29oTqJjeXbli5r3FX
 f0MYLpDEdDH4GuKMgdvDt+CFzGI7NHWupjyYbm/uT0LCEvLhmKnh3z+3HhmUO2wD
 7PoDp63t779YhiScnfA+P5CXmjnMjZNtqeCyJYiCDrzcCS3VVQVocFf3iEKF8xrf
 gOju+BZHuNzw5S3Tn0kf3LaexUW9eMD1LAYDooF1eqpl/hrmsYpLhw3mI7601CDa
 UNHTY2Wz6Y/2PwkpzKgsHrrKnyaQWak/kZl0jFy7lZhvJqFyNPe8nVWYRdPdVeHk
 /b/TKJZDiRQXZCrcfEXc
 =GDAT
 -----END PGP SIGNATURE-----

Merge tag 'firewire-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/ieee1394/linux1394

Pull firewire fix from Stefan Richter:
 "A regression fix for the IEEE 1394 subsystem: re-enable IRQ-based
  asynchronous request reception at addresses below 128 TB"

* tag 'firewire-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/ieee1394/linux1394:
  firewire: revert to 4 GB RDMA, fix protocols using Memory Space
2014-05-30 12:06:15 -07:00
Linus Torvalds
24e19d279f A dm-cache stable fix to split discards on cache block boundaries
because dm-cache cannot yet handle discards that span cache blocks.
 
 Really fix a dm-mpath LOCKDEP warning that was introduced in -rc1.
 
 Add a 'no_space_timeout' control to dm-thinp to restore the ability to
 queue IO indefinitely when no data space is available.  This fixes a
 change in behavior that was introduced in -rc6 where the timeout
 couldn't be disabled.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQEcBAABAgAGBQJTh9QAAAoJEMUj8QotnQNaNpYH/j07FeH8YlxXRcFzDi7xRVtx
 luK5b9fLLlmPwW2eKSrvpI8Le4jwDvLwBmpEvN9/wyPiRDSUnYIyYdoV7RJXX2LT
 wqXatObb84fwQBJ6/q8o2YMzU5ODa5XT6KGEZyD4cHdAZ9FZSwfgqhslyrBJDkSN
 JBFfkXu066qw8cuYA6KFv4DwBf5eHAt5AjV/QPGd5zGXwETHLZ4ypgpwYHAGbdXa
 MgfHetwtEnJYvVQex/e+9xC5IDc4/BEAhZq4n3YmEJjNq8EbX15udHmCX7S2M5pT
 +9tNjUMz4j9BhoC9F8ntRz0pxWZtJK9hGojO4xoXqOCOHgp1xLQd/tHrFZS0v8E=
 =u5Xd
 -----END PGP SIGNATURE-----

Merge tag 'dm-3.15-fixes-3' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm

Pull device-mapper fixes from Mike Snitzer:
 "A dm-cache stable fix to split discards on cache block boundaries
  because dm-cache cannot yet handle discards that span cache blocks.

  Really fix a dm-mpath LOCKDEP warning that was introduced in -rc1.

  Add a 'no_space_timeout' control to dm-thinp to restore the ability to
  queue IO indefinitely when no data space is available.  This fixes a
  change in behavior that was introduced in -rc6 where the timeout
  couldn't be disabled"

* tag 'dm-3.15-fixes-3' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
  dm mpath: really fix lockdep warning
  dm cache: always split discards on cache block boundaries
  dm thin: add 'no_space_timeout' dm-thin-pool module param
2014-05-30 12:04:56 -07:00
Minchan Kim
6538b8ea88 x86_64: expand kernel stack to 16K
While I play inhouse patches with much memory pressure on qemu-kvm,
3.14 kernel was randomly crashed. The reason was kernel stack overflow.

When I investigated the problem, the callstack was a little bit deeper
by involve with reclaim functions but not direct reclaim path.

I tried to diet stack size of some functions related with alloc/reclaim
so did a hundred of byte but overflow was't disappeard so that I encounter
overflow by another deeper callstack on reclaim/allocator path.

Of course, we might sweep every sites we have found for reducing
stack usage but I'm not sure how long it saves the world(surely,
lots of developer start to add nice features which will use stack
agains) and if we consider another more complex feature in I/O layer
and/or reclaim path, it might be better to increase stack size(
meanwhile, stack usage on 64bit machine was doubled compared to 32bit
while it have sticked to 8K. Hmm, it's not a fair to me and arm64
already expaned to 16K. )

So, my stupid idea is just let's expand stack size and keep an eye
toward stack consumption on each kernel functions via stacktrace of ftrace.
For example, we can have a bar like that each funcion shouldn't exceed 200K
and emit the warning when some function consumes more in runtime.
Of course, it could make false positive but at least, it could make a
chance to think over it.

I guess this topic was discussed several time so there might be
strong reason not to increase kernel stack size on x86_64, for me not
knowing so Ccing x86_64 maintainers, other MM guys and virtio
maintainers.

Here's an example call trace using up the kernel stack:

         Depth    Size   Location    (51 entries)
         -----    ----   --------
   0)     7696      16   lookup_address
   1)     7680      16   _lookup_address_cpa.isra.3
   2)     7664      24   __change_page_attr_set_clr
   3)     7640     392   kernel_map_pages
   4)     7248     256   get_page_from_freelist
   5)     6992     352   __alloc_pages_nodemask
   6)     6640       8   alloc_pages_current
   7)     6632     168   new_slab
   8)     6464       8   __slab_alloc
   9)     6456      80   __kmalloc
  10)     6376     376   vring_add_indirect
  11)     6000     144   virtqueue_add_sgs
  12)     5856     288   __virtblk_add_req
  13)     5568      96   virtio_queue_rq
  14)     5472     128   __blk_mq_run_hw_queue
  15)     5344      16   blk_mq_run_hw_queue
  16)     5328      96   blk_mq_insert_requests
  17)     5232     112   blk_mq_flush_plug_list
  18)     5120     112   blk_flush_plug_list
  19)     5008      64   io_schedule_timeout
  20)     4944     128   mempool_alloc
  21)     4816      96   bio_alloc_bioset
  22)     4720      48   get_swap_bio
  23)     4672     160   __swap_writepage
  24)     4512      32   swap_writepage
  25)     4480     320   shrink_page_list
  26)     4160     208   shrink_inactive_list
  27)     3952     304   shrink_lruvec
  28)     3648      80   shrink_zone
  29)     3568     128   do_try_to_free_pages
  30)     3440     208   try_to_free_pages
  31)     3232     352   __alloc_pages_nodemask
  32)     2880       8   alloc_pages_current
  33)     2872     200   __page_cache_alloc
  34)     2672      80   find_or_create_page
  35)     2592      80   ext4_mb_load_buddy
  36)     2512     176   ext4_mb_regular_allocator
  37)     2336     128   ext4_mb_new_blocks
  38)     2208     256   ext4_ext_map_blocks
  39)     1952     160   ext4_map_blocks
  40)     1792     384   ext4_writepages
  41)     1408      16   do_writepages
  42)     1392      96   __writeback_single_inode
  43)     1296     176   writeback_sb_inodes
  44)     1120      80   __writeback_inodes_wb
  45)     1040     160   wb_writeback
  46)      880     208   bdi_writeback_workfn
  47)      672     144   process_one_work
  48)      528     112   worker_thread
  49)      416     240   kthread
  50)      176     176   ret_from_fork

[ Note: the problem is exacerbated by certain gcc versions that seem to
  generate much bigger stack frames due to apparently bad coalescing of
  temporaries and generating too many spills.  Rusty saw gcc-4.6.4 using
  35% more stack on the virtio path than 4.8.2 does, for example.

  Minchan not only uses such a bad gcc version (4.6.3 in his case), but
  some of the stack use is due to debugging (CONFIG_DEBUG_PAGEALLOC is
  what causes that kernel_map_pages() frame, for example). But we're
  clearly getting too close.

  The VM code also seems to have excessive stack frames partly for the
  same compiler reason, triggered by excessive inlining and lots of
  function arguments.

  We need to improve on our stack use, but in the meantime let's do this
  simple stack increase too.  Unlike most earlier reports, there is
  nothing simple that stands out as being really horribly wrong here,
  apart from the fact that the stack frames are just bigger than they
  should need to be.        - Linus ]

Signed-off-by: Minchan Kim <minchan@kernel.org>
Cc: Peter Anvin <hpa@zytor.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Dave Jones <davej@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Michael S Tsirkin <mst@redhat.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: PJ Waskiewicz <pjwaskiewicz@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-05-30 11:52:51 -07:00
Linus Torvalds
6f6111e4a7 Merge branch 'for-linus-2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs dcache livelock fix from Al Viro:
 "Fixes for livelocks in shrink_dentry_list() introduced by fixes to
  shrink list corruption; the root cause was that trylock of parent's
  ->d_lock could be disrupted by d_walk() happening on other CPUs,
  resulting in shrink_dentry_list() making no progress *and* the same
  d_walk() being called again and again for as long as
  shrink_dentry_list() doesn't get past that mess.

  The solution is to have shrink_dentry_list() treat that trylock
  failure not as 'try to do the same thing again', but 'lock them in the
  right order'"

* 'for-linus-2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  dentry_kill() doesn't need the second argument now
  dealing with the rest of shrink_dentry_list() livelock
  shrink_dentry_list(): take parent's ->d_lock earlier
  expand dentry_kill(dentry, 0) in shrink_dentry_list()
  split dentry_kill()
  lift the "already marked killed" case into shrink_dentry_list()
2014-05-30 09:52:55 -07:00