linux/drivers
Trigger Huang 85744e9c10 drm/scheduler: Fix bad job be re-processed in TDR
A bad job is the one triggered TDR(In the current amdgpu's
implementation, actually all the jobs in the current joq-queue will
be treated as bad jobs). In the recovery process, its fence
will be fake signaled and as a result, the work behind will be scheduled
to delete it from the mirror list, but if the TDR process is invoked
before the work's execution, then this bad job might be processed again
and the call dma_fence_set_error to its fence in TDR process will lead to
kernel warning trace:

[  143.033605] WARNING: CPU: 2 PID: 53 at ./include/linux/dma-fence.h:437 amddrm_sched_job_recovery+0x1af/0x1c0 [amd_sched]
kernel: [  143.033606] Modules linked in: amdgpu(OE) amdchash(OE) amdttm(OE) amd_sched(OE) amdkcl(OE) amd_iommu_v2 drm_kms_helper drm i2c_algo_bit fb_sys_fops syscopyarea sysfillrect sysimgblt kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc aesni_intel aes_x86_64 snd_hda_codec_generic crypto_simd glue_helper cryptd snd_hda_intel snd_hda_codec snd_hda_core snd_hwdep snd_pcm snd_seq_midi snd_seq_midi_event snd_rawmidi snd_seq joydev snd_seq_device snd_timer snd soundcore binfmt_misc input_leds mac_hid serio_raw nfsd auth_rpcgss nfs_acl lockd grace sunrpc sch_fq_codel parport_pc ppdev lp parport ip_tables x_tables autofs4 8139too floppy psmouse 8139cp mii i2c_piix4 pata_acpi
[  143.033649] CPU: 2 PID: 53 Comm: kworker/2:1 Tainted: G           OE    4.15.0-20-generic #21-Ubuntu
[  143.033650] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
[  143.033653] Workqueue: events drm_sched_job_timedout [amd_sched]
[  143.033656] RIP: 0010:amddrm_sched_job_recovery+0x1af/0x1c0 [amd_sched]
[  143.033657] RSP: 0018:ffffa9f880fe7d48 EFLAGS: 00010202
[  143.033659] RAX: 0000000000000007 RBX: ffff9b98f2b24c00 RCX: ffff9b98efef4f08
[  143.033660] RDX: ffff9b98f2b27400 RSI: ffff9b98f2b24c50 RDI: ffff9b98efef4f18
[  143.033660] RBP: ffffa9f880fe7d98 R08: 0000000000000001 R09: 00000000000002b6
[  143.033661] R10: 0000000000000000 R11: 0000000000000000 R12: ffff9b98efef3430
[  143.033662] R13: ffff9b98efef4d80 R14: ffff9b98efef4e98 R15: ffff9b98eaf91c00
[  143.033663] FS:  0000000000000000(0000) GS:ffff9b98ffd00000(0000) knlGS:0000000000000000
[  143.033664] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  143.033665] CR2: 00007fc49c96d470 CR3: 000000001400a005 CR4: 00000000003606e0
[  143.033669] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  143.033669] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  143.033670] Call Trace:
[  143.033744]  amdgpu_device_gpu_recover+0x144/0x820 [amdgpu]
[  143.033788]  amdgpu_job_timedout+0x9b/0xa0 [amdgpu]
[  143.033791]  drm_sched_job_timedout+0xcc/0x150 [amd_sched]
[  143.033795]  process_one_work+0x1de/0x410
[  143.033797]  worker_thread+0x32/0x410
[  143.033799]  kthread+0x121/0x140
[  143.033801]  ? process_one_work+0x410/0x410
[  143.033803]  ? kthread_create_worker_on_cpu+0x70/0x70
[  143.033806]  ret_from_fork+0x35/0x40

So just delete the bad job from mirror list directly

Changes in v3:
	- Add a helper function to delete the bad jobs from mirror list and call
		it directly *before* the job's fence is signaled

Changes in v2:
	- delete the useless list node check
	- also delete bad jobs in drm_sched_main because:
		kthread_unpark(ring->sched.thread) will be invoked very early before
		amdgpu_device_gpu_recover's return, then drm_sched_main will have
		chance to pick up a new job from the job queue. This new job will be
		added into the mirror list and processed by amdgpu_job_run, but may
		not be deleted from the mirror list on time due to the same reason.
		And finally re-processed by drm_sched_job_recovery

Signed-off-by: Trigger Huang <Trigger.Huang@amd.com>
Reviewed-by: Christian König <chrstian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2018-11-19 16:38:15 -05:00
..
accessibility
acpi libnvdimm 4.20-rc3 2018-11-18 12:21:09 -08:00
amba
android
ata libata: blacklist SAMSUNG MZ7TD256HAFV-000L9 SSD 2018-11-12 12:44:19 -07:00
atm atm: zatm: Fix empty body Clang warnings 2018-10-18 15:39:10 -07:00
auxdisplay The Compiler Attributes series 2018-11-01 18:34:46 -07:00
base mm/memory_hotplug: fix online/offline_pages called w.o. mem_hotplug_lock 2018-10-31 08:54:17 -07:00
bcma
block for-linus-20181115 2018-11-16 09:31:59 -06:00
bluetooth Merge branch 'work.tty-ioctl' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2018-10-24 14:43:41 +01:00
bus ARM: SoC driver updates for 4.17 2018-10-29 15:16:01 -07:00
cdrom gdrom: fix mistake in assignment of error 2018-10-25 11:17:40 -06:00
char RTC for 4.20 2018-10-27 09:24:24 -07:00
clk clk: qcom: gcc: Fix board clock node name 2018-11-09 14:13:55 -08:00
clocksource Merge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2018-11-11 16:41:50 -06:00
connector
cpufreq cpufreq: imx6q: add return value check for voltage scale 2018-11-07 13:31:07 +01:00
cpuidle ARM: cpuidle: Convert to use cpuidle_register|unregister() 2018-11-08 18:53:00 +01:00
crypto crypto: hisilicon - Fix reference after free of memories on error path 2018-11-09 17:35:43 +08:00
dax
dca
devfreq
dio
dma pci-v4.20-changes 2018-10-25 06:50:48 -07:00
dma-buf dma-buf: Update reservation shared_count after adding the new fence 2018-10-26 15:42:11 +01:00
edac * skx_edac: Address translation for NVDIMMs (Tony Luck and Qiuxu Zhuo) 2018-11-02 11:17:22 -07:00
eisa
extcon
firewire
firmware efi: Permit calling efi_mem_reserve_persistent() from atomic context 2018-11-15 10:04:47 +01:00
fmc
fpga fpga: add devm_fpga_region_create 2018-10-16 11:13:50 +02:00
fsi iov_iter: Separate type from direction and use accessor functions 2018-10-24 00:41:07 +01:00
gnss
gpio pci-v4.20-changes 2018-10-25 06:50:48 -07:00
gpu drm/scheduler: Fix bad job be re-processed in TDR 2018-11-19 16:38:15 -05:00
hid HID: asus: fix build warning wiht CONFIG_ASUS_WMI disabled 2018-11-06 13:57:42 +01:00
hsi
hv hv_balloon: Replace spin_is_locked() with lockdep 2018-10-15 20:54:17 +02:00
hwmon hwmon: (ibmpowernv) Remove bogus __init annotations 2018-11-04 15:55:12 -08:00
hwspinlock
hwtracing stm class: Use memcat_p() 2018-10-11 12:12:55 +02:00
i2c i2c: nvidia-gpu: make pm_ops static 2018-11-09 17:56:44 +01:00
ide
idle Merge branch 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2018-10-23 13:32:18 +01:00
iio Staging/IIO patches for 4.20-rc1 2018-10-29 10:38:10 -07:00
infiniband Revert "mm, mmu_notifier: annotate mmu notifiers with blockable invalidate callbacks" 2018-10-26 16:25:19 -07:00
input Merge branch 'xarray' of git://git.infradead.org/users/willy/linux-dax 2018-10-28 11:35:40 -07:00
iommu mm: remove include/linux/bootmem.h 2018-10-31 08:54:16 -07:00
ipack
irqchip irqchip/irq-mvebu-sei: Fix a NULL vs IS_ERR() bug in probe function 2018-11-01 12:38:48 +01:00
isdn Merge branch 'work.afs' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2018-11-01 19:58:52 -07:00
leds LED fixes for 4.20-rc2 2018-11-08 17:49:04 -06:00
lightnvm lightnvm: pblk: guarantee that backpointer is respected on writer stall 2018-10-09 08:25:08 -06:00
macintosh memblock: stop using implicit alignment to SMP_CACHE_BYTES 2018-10-31 08:54:16 -07:00
mailbox - Convert print users to use the %pOFn format specifier 2018-10-29 10:30:44 -07:00
mcb
md for-linus-20181102 2018-11-02 11:25:48 -07:00
media drm-misc-next for v4.21, part 1: 2018-11-19 10:40:33 +10:00
memory
memstick
message
mfd chrome-platform for v4.20 2018-10-31 16:47:55 -07:00
misc Merge branch 'work.afs' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2018-11-01 19:58:52 -07:00
mmc Merge branch 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2018-10-23 13:32:18 +01:00
mtd mtd: sa1100: avoid VLA in sa1100_setup_mtd 2018-11-06 10:38:17 +01:00
mux This is the bulk of GPIO changes for the v4.20 series: 2018-10-23 08:45:05 +01:00
net net: dsa: mv88e6xxx: Fix clearing of stats counters 2018-11-11 10:19:10 -08:00
nfc NFC: nfcmrvl_uart: fix OF child-node lookup 2018-10-23 13:28:53 -05:00
ntb ntb: idt: Alter the driver info comments 2018-11-01 10:33:12 -04:00
nubus
nvdimm libnvdimm for 4.20 2018-10-25 06:31:56 -07:00
nvme for-linus-20181109 2018-11-09 16:31:51 -06:00
nvmem nvmem: hide unused nvmem_find_cell_by_index function 2018-10-15 15:56:15 +02:00
of Devicetree fixes for 4.20-rc: 2018-11-09 16:41:58 -06:00
opp
oprofile
parisc parisc: Add alternative coding infrastructure 2018-10-17 17:22:26 +02:00
parport
pci Revert "ACPI/PCI: Pay attention to device-specific _PXM node values" 2018-11-13 08:38:17 -06:00
pcmcia powerpc updates for 4.20 2018-10-26 14:36:21 -07:00
perf arm64 updates for 4.20: 2018-10-22 17:30:06 +01:00
phy USB/PHY patches for 4.20-rc1 2018-10-26 08:14:13 -07:00
pinctrl pinctrl: meson: fix meson8b ao pull register bits 2018-11-05 09:33:22 +01:00
platform platform-drivers-x86 for v4.20-1 2018-11-01 08:42:21 -07:00
pnp
power Devicetree updates for 4.20: 2018-10-26 12:09:58 -07:00
powercap Merge branch 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2018-10-23 13:32:18 +01:00
pps
ps3
ptp ptp: drop redundant kasprintf() to create worker name 2018-10-28 19:20:06 -07:00
pwm pwm: lpss: Only set update bit if we are actually changing the settings 2018-10-16 13:16:15 +02:00
rapidio
ras
regulator regulator: Regulator updates for next release 2018-10-23 01:54:44 +01:00
remoteproc remoteproc: qcom: q6v5-mss: Register segments/dumpfn for coredump 2018-10-19 12:54:03 -07:00
reset ARM: SoC driver updates for 4.17 2018-10-29 15:16:01 -07:00
rpmsg
rtc rtc: pcf2127: fix a kmemleak caused in pcf2127_i2c_gather_write 2018-11-07 17:13:56 +01:00
s390 s390/qeth: report 25Gbit link speed 2018-11-03 10:44:06 -07:00
sbus oradax: remove redundant null check before kfree 2018-10-07 22:42:00 -07:00
scsi for-linus-20181115 2018-11-16 09:31:59 -06:00
sfi mm: remove include/linux/bootmem.h 2018-10-31 08:54:16 -07:00
sh
siox
slimbus
sn
soc soc: ti: QMSS: Fix usage of irq_set_affinity_hint 2018-11-02 11:22:09 -07:00
soundwire
spi - New Drivers 2018-10-25 06:19:15 -07:00
spmi
ssb ssb: chipcommon: fix fall-through annotation 2018-10-05 11:37:20 +03:00
staging drm/ttm: initialize globals during device init (v2) 2018-11-05 14:21:21 -05:00
target scsi: target/core: Avoid that a kernel oops is triggered when COMPARE AND WRITE fails 2018-11-05 22:16:00 -05:00
tc TC: Set DMA masks for devices 2018-10-11 09:16:44 -07:00
tee
thermal Merge branch 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/rzhang/linux 2018-10-31 11:28:12 -07:00
thunderbolt Merge 4.19-rc7 into char-misc-next 2018-10-08 15:33:21 +02:00
tty TTY/Serial fixes for 4.20-rc2 2018-11-10 13:32:14 -06:00
uio
usb usb: typec: ucsi: add support for Cypress CCGx 2018-11-09 18:49:59 +01:00
uwb
vfio VFIO updates for v4.20 2018-10-31 11:01:38 -07:00
vhost virtio, vhost: fixes, tweaks 2018-11-01 14:42:49 -07:00
video drm-misc-next for v4.21, part 1: 2018-11-19 10:40:33 +10:00
virt
virtio virtio-balloon: VIRTIO_BALLOON_F_PAGE_POISON 2018-10-24 20:57:55 -04:00
visorbus
vlynq
vme
w1 w1: IAD Register is yet readable trough iad sys file. Fix snprintf (%u for unsigned, count for max size). 2018-10-15 20:50:32 +02:00
watchdog watchdog: ts4800: release syscon device node in ts4800_wdt_probe() 2018-10-22 10:16:28 +02:00
xen xen: fixes for 4.20-rc2 2018-11-10 08:58:48 -06:00
zorro
Kconfig
Makefile