bianbu-linux-6.6

mirror of https://gitee.com/bianbu-linux/linux-6.6 synced 2025-07-24 01:54:03 -04:00

Author	SHA1	Message	Date
Christian König	cbfae36cea	drm/amdgpu: cleanup PTE flag generation v3 Move the ASIC specific code into a new callback function. v2: mask the flags for SI and CIK instead of a BUG_ON(). v3: remove last missed BUG_ON(). Signed-off-by: Christian König <christian.koenig@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Reviewed-by: Huang Rui <ray.huang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2019-09-16 09:59:29 -05:00
Christian König	71776b6dae	drm/amdgpu: cleanup mtype mapping Unify how we map the UAPI flags to the PTE hardware flags for a mapping. Only the MTYPE is actually ASIC dependent, all other flags should be copied over 1 to 1 and ASIC differences are handled later on. Signed-off-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2019-09-16 09:59:21 -05:00
Tianci.Yin	1dd077bbba	drm/amdgpu: add navi14 PCI ID for work station SKU Add the navi14 PCI device id. Reviewed-by: Feifei Xu <Feifei.Xu@amd.com> Signed-off-by: Tianci.Yin <tianci.yin@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2019-09-16 09:59:14 -05:00
Philip Yang	cde85ac247	drm/amdgpu: check if nbio->ras_if exist To avoid NULL function pointer access. This happens on VG10, reboot command hangs and have to power off/on to reboot the machine. This is serial console log: [ OK ] Reached target Unmount All Filesystems. [ OK ] Reached target Final Step. Starting Reboot... [ 305.696271] systemd-shutdown[1]: Syncing filesystems and block devices. [ 306.947328] systemd-shutdown[1]: Sending SIGTERM to remaining processes... [ 306.963920] systemd-journald[1722]: Received SIGTERM from PID 1 (systemd-shutdow). [ 307.322717] systemd-shutdown[1]: Sending SIGKILL to remaining processes... [ 307.336472] systemd-shutdown[1]: Unmounting file systems. [ 307.454202] EXT4-fs (sda2): re-mounted. Opts: errors=remount-ro [ 307.480523] systemd-shutdown[1]: All filesystems unmounted. [ 307.486537] systemd-shutdown[1]: Deactivating swaps. [ 307.491962] systemd-shutdown[1]: All swaps deactivated. [ 307.497624] systemd-shutdown[1]: Detaching loop devices. [ 307.504418] systemd-shutdown[1]: All loop devices detached. [ 307.510418] systemd-shutdown[1]: Detaching DM devices. [ 307.565907] sd 2:0:0:0: [sda] Synchronizing SCSI cache [ 307.731313] BUG: kernel NULL pointer dereference, address: 0000000000000000 [ 307.738802] #PF: supervisor read access in kernel mode [ 307.744326] #PF: error_code(0x0000) - not-present page [ 307.749850] PGD 0 P4D 0 [ 307.752568] Oops: 0000 [#1] SMP PTI [ 307.756314] CPU: 3 PID: 1 Comm: systemd-shutdow Not tainted 5.2.0-rc1-kfd-yangp #453 [ 307.764644] Hardware name: ASUS All Series/Z97-PRO(Wi-Fi ac)/USB 3.1, BIOS 9001 03/07/2016 [ 307.773580] RIP: 0010:soc15_common_hw_fini+0x33/0xc0 [amdgpu] [ 307.779760] Code: 89 fb e8 60 f5 ff ff f6 83 50 df 01 00 04 75 3d 48 8b b3 90 7d 00 00 48 c7 c7 17 b8 530 [ 307.799967] RSP: 0018:ffffac9483153d40 EFLAGS: 00010286 [ 307.805585] RAX: 0000000000000000 RBX: ffff9eb299da0000 RCX: 0000000000000006 [ 307.813261] RDX: 0000000000000000 RSI: ffff9eb29e3508a0 RDI: ffff9eb29e350000 [ 307.820935] RBP: ffff9eb299da0000 R08: 0000000000000000 R09: 0000000000000000 [ 307.828609] R10: 0000000000000000 R11: 0000000000000000 R12: ffff9eb299dbd1f8 [ 307.836284] R13: ffffffffc04f8368 R14: ffff9eb29cebd130 R15: 0000000000000000 [ 307.843959] FS: 00007f06721c9940(0000) GS:ffff9eb2a18c0000(0000) knlGS:0000000000000000 [ 307.852663] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 307.858842] CR2: 0000000000000000 CR3: 000000081d798005 CR4: 00000000001606e0 [ 307.866516] Call Trace: [ 307.869169] amdgpu_device_ip_suspend_phase2+0x80/0x110 [amdgpu] [ 307.875654] ? amdgpu_device_ip_suspend_phase1+0x4d/0xd0 [amdgpu] [ 307.882230] amdgpu_device_ip_suspend+0x2e/0x60 [amdgpu] [ 307.887966] amdgpu_pci_shutdown+0x2f/0x40 [amdgpu] [ 307.893211] pci_device_shutdown+0x31/0x60 [ 307.897613] device_shutdown+0x14c/0x1f0 [ 307.901829] kernel_restart+0xe/0x50 [ 307.905669] __do_sys_reboot+0x1df/0x210 [ 307.909884] ? task_work_run+0x73/0xb0 [ 307.913914] ? trace_hardirqs_off_thunk+0x1a/0x1c [ 307.918970] do_syscall_64+0x4a/0x1c0 [ 307.922904] entry_SYSCALL_64_after_hwframe+0x49/0xbe [ 307.928336] RIP: 0033:0x7f0671cf8373 [ 307.932176] Code: 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 89 fa be 69 19 128 [ 307.952384] RSP: 002b:00007ffdd1723d68 EFLAGS: 00000202 ORIG_RAX: 00000000000000a9 [ 307.960527] RAX: ffffffffffffffda RBX: 0000000001234567 RCX: 00007f0671cf8373 [ 307.968201] RDX: 0000000001234567 RSI: 0000000028121969 RDI: 00000000fee1dead [ 307.975875] RBP: 00007ffdd1723dd0 R08: 0000000000000000 R09: 0000000000000000 [ 307.983550] R10: 0000000000000002 R11: 0000000000000202 R12: 00007ffdd1723dd8 [ 307.991224] R13: 0000000000000000 R14: 0000001b00000004 R15: 00007ffdd17240c8 [ 307.998901] Modules linked in: xt_MASQUERADE nfnetlink iptable_nat xt_addrtype xt_conntrack nf_nat nf_cos [ 308.026505] CR2: 0000000000000000 [ 308.039998] RIP: 0010:soc15_common_hw_fini+0x33/0xc0 [amdgpu] [ 308.046180] Code: 89 fb e8 60 f5 ff ff f6 83 50 df 01 00 04 75 3d 48 8b b3 90 7d 00 00 48 c7 c7 17 b8 530 [ 308.066392] RSP: 0018:ffffac9483153d40 EFLAGS: 00010286 [ 308.072013] RAX: 0000000000000000 RBX: ffff9eb299da0000 RCX: 0000000000000006 [ 308.079689] RDX: 0000000000000000 RSI: ffff9eb29e3508a0 RDI: ffff9eb29e350000 [ 308.087366] RBP: ffff9eb299da0000 R08: 0000000000000000 R09: 0000000000000000 [ 308.095042] R10: 0000000000000000 R11: 0000000000000000 R12: ffff9eb299dbd1f8 [ 308.102717] R13: ffffffffc04f8368 R14: ffff9eb29cebd130 R15: 0000000000000000 [ 308.110394] FS: 00007f06721c9940(0000) GS:ffff9eb2a18c0000(0000) knlGS:0000000000000000 [ 308.119099] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 308.125280] CR2: 0000000000000000 CR3: 000000081d798005 CR4: 00000000001606e0 [ 308.135304] printk: systemd-shutdow: 3 output lines suppressed due to ratelimiting [ 308.143518] Kernel panic - not syncing: Attempted to kill init! exitcode=0x00000009 [ 308.151798] Kernel Offset: 0x15000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xff) [ 308.171775] ---[ end Kernel panic - not syncing: Attempted to kill init! exitcode=0x00000009 ]--- Signed-off-by: Philip Yang <Philip.Yang@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2019-09-16 09:58:52 -05:00
Xiaojie Yuan	bfa603aa5e	drm/amdgpu: fix null pointer deref in firmware header printing v2: declare as (struct common_firmware_header *) type because struct xxx_firmware_header inherits from it When CE's ucode_id(8) is used to get sdma_hdr, we will be accessing an unallocated amdgpu_firmware_info instance. This issue appears on rhel7.7 with gcc 4.8.5. Newer compilers might have optimized out such 'defined but not referenced' variable. [ 1120.798564] BUG: unable to handle kernel NULL pointer dereference at 000000000000000a [ 1120.806703] IP: [<ffffffffc0e3c9b3>] psp_np_fw_load+0x1e3/0x390 [amdgpu] [ 1120.813693] PGD 80000002603ff067 PUD 271b8d067 PMD 0 [ 1120.818931] Oops: 0000 [#1] SMP [ 1120.822245] Modules linked in: amdgpu(OE+) amdkcl(OE) amd_iommu_v2 amdttm(OE) amd_sched(OE) xt_CHECKSUM ipt_MASQUERADE nf_nat_masquerade_ipv4 tun bridge stp llc devlink ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 xt_conntrack ebtable_nat ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_security ip6table_raw iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat iptable_mangle iptable_security iptable_raw nf_conntrack libcrc32c ip_set nfnetlink ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter sunrpc dm_mirror dm_region_hash dm_log dm_mod intel_pmc_core intel_powerclamp coretemp intel_rapl joydev kvm_intel eeepc_wmi asus_wmi kvm sparse_keymap iTCO_wdt irqbypass rfkill crc32_pclmul snd_hda_codec_realtek mxm_wmi ghash_clmulni_intel intel_wmi_thunderbolt iTCO_vendor_support snd_hda_codec_generic snd_hda_codec_hdmi aesni_intel lrw gf128mul glue_helper ablk_helper sg cryptd pcspkr snd_hda_intel snd_hda_codec snd_hda_core snd_hwdep snd_seq snd_seq_device snd_pcm snd_timer snd pinctrl_sunrisepoint pinctrl_intel soundcore acpi_pad mei_me wmi mei i2c_i801 pcc_cpufreq ip_tables ext4 mbcache jbd2 sd_mod crc_t10dif crct10dif_generic i915 i2c_algo_bit iosf_mbi drm_kms_helper e1000e syscopyarea sysfillrect sysimgblt fb_sys_fops ahci libahci drm ptp libata crct10dif_pclmul crct10dif_common crc32c_intel serio_raw pps_core drm_panel_orientation_quirks video i2c_hid [ 1120.954136] CPU: 4 PID: 2426 Comm: modprobe Tainted: G OE ------------ 3.10.0-1062.el7.x86_64 #1 [ 1120.964390] Hardware name: System manufacturer System Product Name/Z170-A, BIOS 1302 11/09/2015 [ 1120.973321] task: ffff991ef1e3c1c0 ti: ffff991ee625c000 task.ti: ffff991ee625c000 [ 1120.981020] RIP: 0010:[<ffffffffc0e3c9b3>] [<ffffffffc0e3c9b3>] psp_np_fw_load+0x1e3/0x390 [amdgpu] [ 1120.990483] RSP: 0018:ffff991ee625f950 EFLAGS: 00010202 [ 1120.995935] RAX: 0000000000000002 RBX: ffff991edf6b2d38 RCX: ffff991edf6a0000 [ 1121.003391] RDX: 0000000000000000 RSI: ffff991f01d13898 RDI: ffffffffc110afb3 [ 1121.010706] RBP: ffff991ee625f9b0 R08: 0000000000000000 R09: 0000000000000000 [ 1121.018029] R10: 00000000000004c4 R11: ffff991ee625f64e R12: ffff991edf6b3220 [ 1121.025353] R13: ffff991edf6a0000 R14: 0000000000000008 R15: ffff991edf6b2d30 [ 1121.032666] FS: 00007f97b0c0b740(0000) GS:ffff991f01d00000(0000) knlGS:0000000000000000 [ 1121.041000] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 1121.046880] CR2: 000000000000000a CR3: 000000025e604000 CR4: 00000000003607e0 [ 1121.054239] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 1121.061631] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 1121.068938] Call Trace: [ 1121.071494] [<ffffffffc0e3dba8>] psp_hw_init+0x218/0x270 [amdgpu] [ 1121.077886] [<ffffffffc0da3188>] amdgpu_device_fw_loading+0xe8/0x160 [amdgpu] [ 1121.085296] [<ffffffffc0e3b34c>] ? vega10_ih_irq_init+0x4bc/0x730 [amdgpu] [ 1121.092534] [<ffffffffc0da5c75>] amdgpu_device_init+0x1495/0x1c90 [amdgpu] [ 1121.099675] [<ffffffffc0da9cab>] amdgpu_driver_load_kms+0x8b/0x2f0 [amdgpu] [ 1121.106888] [<ffffffffc01b25cf>] drm_dev_register+0x12f/0x1d0 [drm] [ 1121.113419] [<ffffffffa4dcdfd8>] ? pci_enable_device_flags+0xe8/0x140 [ 1121.120183] [<ffffffffc0da260a>] amdgpu_pci_probe+0xca/0x170 [amdgpu] [ 1121.126919] [<ffffffffa4dcf97a>] local_pci_probe+0x4a/0xb0 [ 1121.132622] [<ffffffffa4dd10c9>] pci_device_probe+0x109/0x160 [ 1121.138607] [<ffffffffa4eb4205>] driver_probe_device+0xc5/0x3e0 [ 1121.144766] [<ffffffffa4eb4603>] __driver_attach+0x93/0xa0 [ 1121.150507] [<ffffffffa4eb4570>] ? __device_attach+0x50/0x50 [ 1121.156422] [<ffffffffa4eb1da5>] bus_for_each_dev+0x75/0xc0 [ 1121.162213] [<ffffffffa4eb3b7e>] driver_attach+0x1e/0x20 [ 1121.167771] [<ffffffffa4eb3620>] bus_add_driver+0x200/0x2d0 [ 1121.173590] [<ffffffffa4eb4c94>] driver_register+0x64/0xf0 [ 1121.179345] [<ffffffffa4dd0905>] __pci_register_driver+0xa5/0xc0 [ 1121.185593] [<ffffffffc099f000>] ? 0xffffffffc099efff [ 1121.190914] [<ffffffffc099f0a4>] amdgpu_init+0xa4/0xb0 [amdgpu] [ 1121.197101] [<ffffffffa4a0210a>] do_one_initcall+0xba/0x240 [ 1121.202901] [<ffffffffa4b1c90a>] load_module+0x271a/0x2bb0 [ 1121.208598] [<ffffffffa4dad740>] ? ddebug_proc_write+0x100/0x100 [ 1121.214894] [<ffffffffa4b1ce8f>] SyS_init_module+0xef/0x140 [ 1121.220698] [<ffffffffa518bede>] system_call_fastpath+0x25/0x2a [ 1121.226870] Code: b4 01 60 a2 00 00 31 c0 e8 83 60 33 e4 41 8b 47 08 48 8b 4d d0 48 c7 c7 b3 af 10 c1 48 69 c0 68 07 00 00 48 8b 84 01 60 a2 00 00 <48> 8b 70 08 31 c0 48 89 75 c8 e8 56 60 33 e4 48 8b 4d d0 48 c7 [ 1121.247422] RIP [<ffffffffc0e3c9b3>] psp_np_fw_load+0x1e3/0x390 [amdgpu] [ 1121.254432] RSP <ffff991ee625f950> [ 1121.258017] CR2: 000000000000000a [ 1121.261427] ---[ end trace e98b35387ede75bd ]--- Signed-off-by: Xiaojie Yuan <xiaojie.yuan@amd.com> Fixes: `c5fb912653` ("drm/amdgpu: add firmware header printing for psp fw loading (v2)") Reviewed-by: Kevin Wang <kevin1.wang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2019-09-16 09:56:01 -05:00
Huang Rui	4042a18872	drm/amdkfd: enable renoir while device probes This patch is to add asic flag to enable device probe during kfd init. Signed-off-by: Huang Rui <ray.huang@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2019-09-16 09:55:50 -05:00
Huang Rui	aa978594cf	drm/amdgpu: disable gfxoff while use no H/W scheduling policy While gfxoff is enabled, the mmVM_XXX registers will be 0xfffffff while the GFX is in "off" state. KFD queue creattion doesn't use ring based method, so it will trigger a VM fault. Signed-off-by: Huang Rui <ray.huang@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2019-09-16 09:55:41 -05:00
Felix Kuehling	7cae706193	drm/amdgpu: Disable retry faults in VMID0 There is no point retrying page faults in VMID0. Those faults are always fatal. Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com> Reviewed-and-Tested-by: Huang Rui <ray.huang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2019-09-16 09:54:34 -05:00
Yong Zhao	4e66d7d215	drm/amdgpu: Add a kernel parameter for specifying the asic type As more and more new asics start to reuse the old device IDs before launch, there is a need to quickly override the existing asic type corresponding to the reused device ID through a kernel parameter. With this, engineers no longer need to rely on local hack patches, facilitating cooperation across teams. Signed-off-by: Yong Zhao <Yong.Zhao@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2019-09-16 09:54:25 -05:00
Alex Deucher	bb42eda284	drm/amdgpu/irq: check if nbio funcs exist We need to check if the nbios funcs exist before checking the individual pointers. Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>	2019-09-16 09:54:14 -05:00
Tao Zhou	1a6fc071e1	drm/amdgpu: move the call of ras recovery_init and bad page reserve to proper place ras recovery_init should be called after ttm init, bad page reserve should be put in front of gpu reset since i2c may be unstable during gpu reset. add cleanup for recovery_init and recovery_fini v2: add more comment and print. remove cancel_work_sync in recovery_init. Signed-off-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Guchun Chen <guchun.chen@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2019-09-13 17:50:47 -05:00
Tao Zhou	87d2b92f1e	drm/amdgpu: save umc error records save umc error records to ras bad page array v2: add bad pages before gpu reset v3: add NULL check for adev->umc.funcs Signed-off-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com> Reviewed-by: Guchun Chen <guchun.chen@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2019-09-13 17:50:40 -05:00
Tao Zhou	78ad00c903	drm/amdgpu: Hook EEPROM table to RAS support eeprom records load and save for ras, move EEPROM records storing to bad page reserving v2: remove redundant check for con->eh_data Signed-off-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com> Reviewed-by: Guchun Chen <guchun.chen@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2019-09-13 17:50:32 -05:00
Tao Zhou	9dc23a6325	drm/amdgpu: change ras bps type to eeprom table record structure change bps type from retired page to eeprom table record, prepare for saving umc error records to eeprom Signed-off-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Guchun Chen <guchun.chen@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2019-09-13 17:50:26 -05:00
Andrey Grodzovsky	4bc2234077	drm/madgpu: Fix EEPROM Checksum calculation. Fix typo which messed up the calculation. Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-and-tested-by: Guchun Chen <guchun.chen@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2019-09-13 17:50:19 -05:00
Andrey Grodzovsky	4d25fba4e3	drm/amdgpu: Remove clock gating restore. Restoring clock gating break SMU opeartion afterwards, avoid this until this further invistigated with SMU. Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-and-tested-by: Guchun Chen <guchun.chen@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2019-09-13 17:50:10 -05:00
Yong Zhao	050091ab6e	drm/amdkfd: Query kfd device info by CHIP id instead of pci device id This optimizes out the pci device id usage in KFD and makes the code more maintainable. Signed-off-by: Yong Zhao <Yong.Zhao@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2019-09-13 17:49:45 -05:00
Felix Kuehling	cd05c86510	drm/amdgpu: Disable page faults while reading user wptrs These wptrs must be pinned and GPU accessible when this is called from hqd_load functions. So they should never fault. This resolves a circular lock dependency issue involving four locks including the DQM lock and mmap_sem. Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com> Reviewed-by: Oak Zeng <Oak.Zeng@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2019-09-13 17:49:38 -05:00
John Clements	337c200756	drm/amdgpu: clean up load TMR sequence Removed redundant goto statement Signed-off-by: John Clements <john.clements@amd.com> Reviewed-by: Feifei Xu <Feifei.Xu@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2019-09-13 17:49:11 -05:00
John Clements	4fb60b02fb	drm/amdgpu: enable TA load support in Arcturus Add support for loading XGMI/RAS FW Signed-off-by: John Clements <john.clements@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2019-09-13 17:49:02 -05:00
Tao Zhou	c5b6e585b2	drm/amdgpu: change r type to int in gmc_v9_0_late_init change r type from bool to int, suitable for both bool and int return value Signed-off-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2019-09-13 17:48:51 -05:00
Jack Zhang	f1d59e00ff	drm/amd/amdgpu: add sw_fini interface for df_funcs add sw_fini interface of df_funcs. This interface will remove sysfs file of df_cntr_avail function. The old behavior only create sysfs of df_cntr_avail in sw_init, but never remove it for lack of sw_fini interface. With this,driver will report create sysfs fail when it's loaded for the second time. Signed-off-by: Jack Zhang <Jack.Zhang1@amd.com> Reviewed-by: Jonathan Kim <Jonathan.Kim@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2019-09-13 17:42:15 -05:00
Hawking Zhang	9dc9134258	drm/amdgpu: init UMC & RSMU register base address UMC RAS feature requires access to UMC & RSMU registers Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2019-09-13 17:42:08 -05:00
Hawking Zhang	1c70d3d9c4	drm/amdgpu/nbio: switch to amdgpu_nbio_ras_late_init helper function amdgpu_nbio_ras_late_init is used to init nbio specfic ras debugfs/sysfs node and nbio specific interrupt handler. It can be shared among nbio generations Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2019-09-13 17:42:02 -05:00
Hawking Zhang	47930de4aa	drm/amdgpu/mmhub: switch to amdgpu_mmhub_ras_late_init helper function amdgpu_mmhub_ras_late_init is used to init mmhub specfic ras debugfs/sysfs node and mmhub specific interrupt handler. It can be shared among mmhub generations Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2019-09-13 17:41:55 -05:00
Hawking Zhang	bfcf62c2a5	drm/amdgpu/sdma: switch to amdgpu_sdma_ras_late_init helper function amdgpu_sdma_ras_late_init is used to init sdma specfic ras debugfs/sysfs node and sdma specific interrupt handler. It can be shared among sdma generations Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2019-09-13 17:41:49 -05:00
Hawking Zhang	6caeee7a70	drm/amdgpu/gfx: switch to amdgpu_gfx_ras_late_init helper function amdgpu_gfx_ras_late_init is used to init gfx specfic ras debugfs/sysfs node and gfx specific interrupt handler. It can be shared among gfx generations Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2019-09-13 17:41:42 -05:00
Hawking Zhang	a85eff14da	drm/amdgpu/gmc: switch to amdgpu_gmc_ras_late_init helper function amdgpu_gmc_ras_late_init is used to init gmc specfic ras debugfs/sysfs node and gmc specific interrupt handler. It can be shared among gmc generations. Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2019-09-13 17:41:36 -05:00
Hawking Zhang	d094aea312	drm/amdgpu: set ip specific ras interface pointer to NULL after free it to prevent access to dangling pointers Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2019-09-13 17:41:29 -05:00
Andrey Grodzovsky	d5ea093eeb	dmr/amdgpu: Add system auto reboot to RAS. In case of RAS error allow user configure auto system reboot through ras_ctrl. This is also part of the temproray work around for the RAS hang problem. v4: Use latest kernel API for disk sync. Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2019-09-13 17:41:17 -05:00
Andrey Grodzovsky	7c6e68c777	drm/amdgpu: Avoid HW GPU reset for RAS. Problem: Under certain conditions, when some IP bocks take a RAS error, we can get into a situation where a GPU reset is not possible due to issues in RAS in SMU/PSP. Temporary fix until proper solution in PSP/SMU is ready: When uncorrectable error happens the DF will unconditionally broadcast error event packets to all its clients/slave upon receiving fatal error event and freeze all its outbound queues, err_event_athub interrupt will be triggered. In such case and we use this interrupt to issue GPU reset. THe GPU reset code is modified for such case to avoid HW reset, only stops schedulers, deatches all in progress and not yet scheduled job's fences, set error code on them and signals. Also reject any new incoming job submissions from user space. All this is done to notify the applications of the problem. v2: Extract amdgpu_amdkfd_pre/post_reset from amdgpu_device_lock/unlock_adev Move amdgpu_job_stop_all_jobs_on_sched to amdgpu_job.c Remove print param from amdgpu_ras_query_error_count v3: Update based on prevoius bug fixing patch to properly call amdgpu_amdkfd_pre_reset for other XGMI hive memebers. Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com> Acked-by: Felix Kuehling <Felix.Kuehling@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2019-09-13 17:41:05 -05:00
Andrey Grodzovsky	12ffa55da6	drm/amdgpu: Fix bugs in amdgpu_device_gpu_recover in XGMI case. Issue 1: In XGMI case amdgpu_device_lock_adev for other devices in hive was called to late, after access to their repsective schedulers. So relocate the lock to the begining of accessing the other devs. Issue 2: Using amdgpu_device_ip_need_full_reset to switch the device list from all devices in hive to the single 'master' device who owns this reset call is wrong because when stopping schedulers we iterate all the devices in hive but when restarting we will only reactivate the 'master' device. Also, in case amdgpu_device_pre_asic_reset conlcudes that full reset IS needed we then have to stop schedulers for all devices in hive and not only the 'master' but with amdgpu_device_ip_need_full_reset we already missed the opprotunity do to so. So just remove this logic and always stop and start all schedulers for all devices in hive. Also minor cleanup and print fix. v4: Minor coding style fix. Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com> Acked-by: Felix Kuehling <Felix.Kuehling@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2019-09-13 17:39:05 -05:00
Christian König	43ce6bab7b	drm/amdgpu: remove amdgpu_cs_try_evict Trying to evict things from the current working set doesn't work that well anymore because of per VM BOs. Rely on reserving VRAM for page tables to avoid contention. Signed-off-by: Christian König <christian.koenig@amd.com> Reviewed-by: Chunming Zhou <david1.zhou@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2019-09-13 17:38:56 -05:00
Christian König	9d1b3c7805	drm/amdgpu: reserve at least 4MB of VRAM for page tables v2 This hopefully helps reduce the contention for page tables. v2: adjust maximum reported VRAM size as well Signed-off-by: Christian König <christian.koenig@amd.com> Reviewed-by: Chunming Zhou <david1.zhou@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2019-09-13 17:38:47 -05:00
Christian König	629be20395	drm/amdgpu: use moving fence instead of exclusive for VM updates Make VM updates depend on the moving fence instead of the exclusive one. Makes it less likely to actually have a dependency. Signed-off-by: Christian König <christian.koenig@amd.com> Reviewed-by: Chunming Zhou <david1.zhou@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2019-09-13 17:38:38 -05:00
Hawking Zhang	39857252e5	drm/amdgpu: only apply gds clearing workaround when ras is supported gds clearing workaround should only be applied on asics that support gfx ras Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2019-09-13 17:36:29 -05:00
Hawking Zhang	8bf2485aec	drm/amdgpu: fix memory leak when ras is not supported on specific ip block free ras_if if ras is not supported Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2019-09-13 17:36:22 -05:00
Hawking Zhang	4ce71be67b	drm/amdgpu: check mmhub_funcs pointer before refering to it mmhub callback functions are not initialized for all the ASICs Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2019-09-13 17:36:16 -05:00
Felix Kuehling	17da41bf00	drm/amdgpu: Remove unnecessary TLB workaround (v2) This workaround is better handled in user mode in a way that doesn't require allocating extra memory and breaking userptr BOs. The TLB bug is a performance bug, not a functional or security bug. Hence it is safe to remove this kernel part of the workaround to allow a better workaround using only virtual address alignments in user mode. v2: Removed VI_BO_SIZE_ALIGN definition Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2019-09-13 17:36:08 -05:00
Felix Kuehling	e0253d083c	drm/amdgpu: Use optimal mtypes and PTE bits for Arcturus For compute VRAM allocations on Arturus use the new RW mtype for non-coherent local memory, CC mtype for coherent local memory and PTE_SNOOPED bit for invalidating non-dirty cache lines on remote XGMI mappings. Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com> Acked-by: Christian König <christian.koenig@amd.com> Tested-by: Amber Lin <Amber.Lin@amd.com> Reviewed-by: Shaoyun Liu <Shaoyun.Liu@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2019-09-13 17:36:01 -05:00
Felix Kuehling	d0ba51b1ca	drm/amdgpu: Determing PTE flags separately for each mapping (v3) The same BO can be mapped with different PTE flags by different GPUs. Therefore determine the PTE flags separately for each mapping instead of storing them in the KFD buffer object. Add a helper function to determine the PTE flags to be extended with ASIC and memory-type-specific logic in subsequent commits. v2: Split Arcturus-specific MTYPE changes into separate commit v3: Fix return type of get_pte_flags to uint64_t Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com> Acked-by: Christian König <christian.koenig@amd.com> Reviewed-by: Shaoyun Liu <Shaoyun.Liu@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2019-09-13 17:35:55 -05:00
Oak Zeng	093e48c04d	drm/amdgpu: Support new arcturus mtype Arcturus repurposed mtype WC to RW. Modify gmc functions to support the new mtype Signed-off-by: Oak Zeng <Oak.Zeng@amd.com> Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Reviewed-by: Shaoyun Liu <Shaoyun.Liu@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2019-09-13 17:35:48 -05:00
Hawking Zhang	22e1d14fef	drm/amdgpu: switch to amdgpu_ras_late_init for nbio v7_4 (v2) call helper function in late init phase to handle ras init for nbio ip block v2: init local var r to 0 in case the function return failure on asics that don't have ras_late_init implementation Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2019-09-13 17:11:05 -05:00
Hawking Zhang	9ad1dc295b	drm/amdgpu: add ras_late_init callback function for nbio v7_4 (v3) ras_late_init callback function will be used to do common ras init in late init phase. v2: call ras_late_fini to do cleanup when fails to enable interrupt v3: rename sysfs/debugfs node name to pcie_bif_xxx Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2019-09-13 17:11:05 -05:00
Hawking Zhang	dda79907a7	drm/amdgpu: add mmhub ras_late_init callback function (v2) The function will be called in late init phase to do mmhub ras init v2: check ras_late_init function pointer before invoking the function Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2019-09-13 17:11:05 -05:00
Hawking Zhang	2452e7783c	drm/amdgpu: switch to amdgpu_ras_late_init for gmc v9 block (v2) call helper function in late init phase to handle ras init for gmc ip block v2: call ras_late_fini to do clean up when fail to enable interrupt Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2019-09-13 17:11:05 -05:00
Hawking Zhang	7d0a31e8cc	drm/amdgpu: switch to amdgpu_ras_late_init for sdma v4 block (v2) call helper function in late init phase to handle ras init for sdma ip block v2: call ras_late_fini to do clean up when fail to enable interrupt Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2019-09-13 17:11:04 -05:00
Hawking Zhang	63fa48db49	drm/amdgpu: switch to amdgpu_ras_late_init for gfx v9 block (v2) call helper function in late init phase to handle ras init for gfx ip block v2: call ras_late_fini to do clean up when fail to enable interrupt Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2019-09-13 17:11:04 -05:00
Hawking Zhang	b293e891b0	drm/amdgpu: add helper function to do common ras_late_init/fini (v3) In late_init for ras, the helper function will be used to 1). disable ras feature if the IP block is masked as disabled 2). send enable feature command if the ip block was masked as enabled 3). create debugfs/sysfs node per IP block 4). register interrupt handler v2: check ih_info.cb to decide add interrupt handler or not v3: add ras_late_fini for cleanup all the ras fs node and remove interrupt handler Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2019-09-13 17:11:04 -05:00
Hawking Zhang	a344db8e5e	drm/amdgpu: poll ras_controller_irq and err_event_athub_irq status For the hardware that can not enable BIF ring for IH cookies for both ras_controller_irq and err_event_athub_irq, the driver has to poll the status register in irq handling and ack the hardware properly when there is interrupt triggered Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>	2019-09-13 17:11:04 -05:00

1 2 3 4 5 ...

6212 commits