User can specify injected instances by the mask. For backward
compatibility, the mask value is incorporated into sub block index
without interface change of RAS TA.
User uses logical mask and driver should convert it to physical value
before sending it to RAS TA.
v2: update parameter name.
Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Stanley.Yang <Stanley.Yang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Add common helper to reset ras error status. It
applies to IP blocks that follow the new ras error
logging register design, and need to write 0 to
reset the error status. For IP blocks that don't
support the new design, please still implement ip
specific helper.
Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Add common helper to query ras error status and
log error information, including memory block id
and erorr count. The helpers are applicable to IP
blocks that follow the new ras error logging design.
For IP blocks that don't support the new design,
please still implement ip specific helper to query
ras error.
v2: optimize struct amdgpu_ras_err_status_reg_entry
and the implementaion in helper (Lijo/Tao)
Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Aldebaran supports VCN and JPEG RAS, it reports unexpected
block id message during VCN and JPEG RAS initialization if VCN
and JPEG block id not defined.
Signed-off-by: Stanley.Yang <Stanley.Yang@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
If a UMC bad page is reserved but not freed by an application, the
application may trigger uncorrectable error repeatly by accessing the page.
v2: add specific function to do the check.
v3: remove duplicate pages, calculate new added bad page number.
v4: reuse save_bad_pages to calculate new added bad page number.
Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Stanley.Yang <Stanley.Yang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
amdgpu_ras_block_late_init will be invoked in IP
specific ras_late_init call as a common helper for
all the IP blocks.
However, when amdgpu_ras_block_late_init call
amdgpu_ras_query_error_count to query ras error
counters, amdgpu_ras_query_error_count queries
all the IP blocks that support ras query interface.
This results to wrong error counters cached in
software copies when there are ras errors detected
at time zero or warm reset procedure. i.e., in
sdma_ras_late_init phase, it counts on sdma/mmhub
errors, while, in mmhub_ras_late_init phase, it
still counts on sdma/mmhub errors.
The change updates amdgpu_ras_query_error_count
interface to allow query specific ip error counter.
It introduces a new input parameter: query_info. if
query_info is NULL, it means query all the IP blocks,
otherwise, only query the ip block specified by
query_info.
Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
SMU add a new variable mca_ceumc_addr to record
umc correctable error address in EccInfo table,
driver side add EccInfo_V2_t to support this feature
Signed-off-by: Stanley.Yang <Stanley.Yang@amd.com>
Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
The fatal error handler is independent from general ras interrupt
handler since there is no related IH ring.
Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Add support for general RAS poison consumption handler.
v2: remove callback function for poison consumption.
Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Add vcn and jpeg ras support options
V2: vcn and jpeg ras flag enabled for aldebaran asic only
V3: vcn and jpeg ras flag disabled for error counter query
Generic poison query interface added
VCN and JPEG ras enabled based on IP version check
V4: vcn and jpeg ras flag moved under ecc flag for dGPU
Signed-off-by: Mohammad Zafar Ziya <Mohammadzafar.ziya@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
It should notice SMU to update bad channel info when detected
uncorrectable error in UMC block
Signed-off-by: Stanley.Yang <Stanley.Yang@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Modify .ras_fini function pointer parameter so that
we can remove redundant intermediate calls in some
ras blocks.
Signed-off-by: yipechai <YiPeng.Chai@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Define amdgpu_ras_late_init to call all ras blocks' .ras_late_init.
Signed-off-by: yipechai <YiPeng.Chai@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Modify .ras_late_init function pointer parameter so that
it can remove redundant intermediate calls in some ras blocks.
Signed-off-by: yipechai <YiPeng.Chai@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
1. Merge amdgpu_ras_late_init to
amdgpu_ras_block_late_init.
2. Remove amdgpu_ras_late_init since no ras block
calls amdgpu_ras_late_init.
3. Merge amdgpu_ras_late_fini to
amdgpu_ras_block_late_fini.
4. Remove amdgpu_ras_late_fini since no ras block
calls amdgpu_ras_late_fini.
Signed-off-by: yipechai <YiPeng.Chai@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
In order to reduce redundant struct conversion, modify
operating sysfs and interrupt function interface parameters.
Signed-off-by: yipechai <YiPeng.Chai@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
1. Define amdgpu_ras_block_late_init to create sysfs nodes
and interrupt handles.
2. Define amdgpu_ras_block_late_fini to remove sysfs nodes
and interrupt handles.
3. Replace ras block variable members in struct
amdgpu_ras_block_object with struct ras_common_if, which
can make it easy to associate each ras block instance
with each ras block functional interface.
4. Add .ras_cb to struct amdgpu_ras_block_object.
5. Change each ras block to fit for the changement of struct
amdgpu_ras_block_object.
Signed-off-by: yipechai <YiPeng.Chai@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Fix the code style warnings in amdgpu_ras:
1. ERROR: space required before the open parenthesis '('.
2. WARNING: line length of xxx exceeds 100 columns.
3. ERROR: "foo* bar" should be "foo *bar".
4. WARNING: unnecessary whitespace before a quoted newline.
5. WARNING: space prohibited before semicolon.
6. WARNING: suspect code indent for conditional statements.
7. WARNING: braces {} are not necessary for single statement blocks.
Signed-off-by: yipechai <YiPeng.Chai@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Acked-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Modify the compilation failed problem when other ras blocks' .h include amdgpu_ras.h.
v2: squash in forward declaration warning fix (Alex)
Signed-off-by: yipechai <YiPeng.Chai@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: John Clements <john.clements@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
1. Define unified ops interface for each block.
2. Add ras_block_match function pointer in ops interface, each ras block can customize specail match function to identify itself.
3. Add amdgpu_ras_block_match_default new function. If a ras block doesn't define .ras_block_match, default execute amdgpu_ras_block_match_default to identify this ras block.
4. Define unified basic ras block data for each ras block.
5. Create dedicated amdgpu device ras block link list to manage all of the ras blocks.
6. Add amdgpu_ras_register_ras_block new function interface for each ras block to register itself to ras controlling block.
Signed-off-by: yipechai <YiPeng.Chai@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: John Clements <john.clements@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Add RAS poison supported flag and tell PSP RAS TA about the info.
v2: rename poison mode to poison supported, we can also disable poison
mode even we support it.
print value of poison supported if ras feature enablement fails.
Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
All code paths under the EAGAIN path in RAS late init are unused.
Signed-off-by: Candice Li <candice.li@amd.com>
Reviewed-by: John Clements <john.clements@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Update RAS infrastructure to support RAS query for MCA subblocks
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: John Clements <john.clements@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Add MPIO to RAS block
Signed-off-by: Candice Li <candice.li@amd.com>
Reviewed-by: John Clements <john.clements@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Add MCA specific IP blocks targetting RAS features
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: John Clements <john.clements@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Adding name field back to ras_common_if to work around error
injection failure with amdgpuras tool.
Signed-off-by: Candice Li <candice.li@amd.com>
Reviewed-by: John Clements <john.clements@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Delete ras_if->name in the RAS ctx structure and remove related lines.
Signed-off-by: Candice Li <candice.li@amd.com>
Reviewed-by: John Clements <john.clements@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Candice Li <candice.li@amd.com>
Reviewed-by: John Clements <john.clements@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
In amdgpu_ras_query_error_count() return an error
if the device doesn't support RAS. This prevents
that function from having to always set the values
of the integer pointers (if set), and thus
prevents function side effects--always to have to
set values of integers if integer pointers set,
regardless of whether RAS is supported or
not--with this change this side effect is
mitigated.
Also, if no pointers are set, don't count, since
we've no way of reporting the counts.
Also, give this function a kernel-doc.
Cc: Alexander Deucher <Alexander.Deucher@amd.com>
Cc: John Clements <john.clements@amd.com>
Cc: Hawking Zhang <Hawking.Zhang@amd.com>
Reported-by: Tom Rix <trix@redhat.com>
Fixes: a46751fbcd ("drm/amdgpu: Fix RAS function interface")
Signed-off-by: Luben Tuikov <luben.tuikov@amd.com>
Reviewed-by: Alexander Deucher <Alexander.Deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
On Context Query2 IOCTL return the correctable and
uncorrectable errors in O(1) fashion, from cached
values, and schedule a delayed work function to
calculate and cache them for the next such IOCTL.
v2: Cancel pending delayed work at ras_fini().
v3: Remove conditionals when dealing with delayed
work manipulation as they're inherently racy.
Cc: Alexander Deucher <Alexander.Deucher@amd.com>
Cc: Christian König <christian.koenig@amd.com>
Cc: John Clements <john.clements@amd.com>
Cc: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Luben Tuikov <luben.tuikov@amd.com>
Reviewed-by: Alexander Deucher <Alexander.Deucher@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
The correctable and uncorrectable errors
are calculated at each invocation of this
function. Therefore, it is highly inefficient to
return just one of them based on a Boolean
input. If the caller wants both, twice the work
would be done. (And this work is O(n^3) on
Vega20.)
Fix this "interface" to simply return what it had
calculated--both values. Let the caller choose
what it wants to record, inspect, use.
Cc: Alexander Deucher <Alexander.Deucher@amd.com>
Cc: John Clements <john.clements@amd.com>
Cc: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Luben Tuikov <luben.tuikov@amd.com>
Reviewed-by: Alexander Deucher <Alexander.Deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Only clear RAS error counters if perestent EDC harvesting is not supported
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: John Clements <john.clements@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Rename,
ras_hw_supported --> ras_hw_enabled, and
ras_features --> ras_enabled,
to show that ras_enabled is a subset of
ras_hw_enabled, which itself is a subset
of the ASIC capability.
Cc: Alexander Deucher <Alexander.Deucher@amd.com>
Cc: John Clements <john.clements@amd.com>
Cc: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Luben Tuikov <luben.tuikov@amd.com>
Acked-by: Christian König <christian.koenig@amd.com>
Reviewed-by: John Clements <John.Clements@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Move ras_hw_supported into struct amdgpu_dev.
The dependency is:
struct amdgpu_ras <== struct amdgpu_dev <== ASIC,
read as "struct amdgpu_ras depends on struct
amdgpu_dev, which depends on the hardware."
This can be loosely understood as, "if RAS is
supported, which is property of the ASIC (struct
amdgpu_dev), then we can access struct
amdgpu_ras."
v2: Fix a typo: must binary AND in ternary cond
in amdgpu_ras.c
Cc: Alexander Deucher <Alexander.Deucher@amd.com>
Cc: John Clements <john.clements@amd.com>
Cc: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Luben Tuikov <luben.tuikov@amd.com>
Acked-by: Christian König <christian.koenig@amd.com>
Reviewed-by: John Clements <John.Clements@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Remove redundant ras->supported, as this value
is also stored in adev->ras_features.
Use adev->ras_features, as that supercedes "ras",
since the latter is its member.
The dependency goes like this:
ras <== adev->ras_features <== hw_supported,
and is read as "ras depends on ras_features, which
depends on hw_supported." The arrows show the flow
of information, i.e. the dependency update.
"hw_supported" should also live in "adev".
Cc: Alexander Deucher <Alexander.Deucher@amd.com>
Cc: John Clements <john.clements@amd.com>
Cc: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Luben Tuikov <luben.tuikov@amd.com>
Acked-by: Christian König <christian.koenig@amd.com>
Reviewed-by: John Clements <John.Clements@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
cause:
It is necessary to send ras disable command to ras-ta during gfx
block ras later init, because the ras capability is disable read
from vbios for vega20 gaming, but the ras context is released
during ras init process, this will cause send ras disable command
to ras-to failed.
how:
Delay releasing ras context, the ras context
will be released after gfx block later init done.
Changed from V1:
move release_ras_context into ras_resume
Changed from V2:
check BIT(UMC) is more reasonable before access eeprom table
Signed-off-by: Stanley.Yang <Stanley.Yang@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
When connected to a host via xGMI, system fatal errors may trigger
warm reset, driver has no change to query edc status before reset.
Therefore in this case, driver should harvest previous error loging
registers during boot, instead of only resetting them.
v2:
1. IP's ras_manager object is created when its ras feature is enabled,
so change to query edc status after amdgpu_ras_late_init called
2. change to enable watchdog timer after finishing gfx edc init
Signed-off-by: Dennis Li <Dennis.Li@amd.com>
Reivewed-by: Hawking Zhang <hawking.zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
If the number of badpage records exceed the threshold, driver has
updated both epprom header and control->tbl_hdr.header before gpu reset,
therefore GPU recovery thread no need to read epprom header directly.
v2: merge amdgpu_ras_check_err_threshold into amdgpu_ras_eeprom_check_err_threshold
Signed-off-by: Dennis Li <Dennis.Li@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Cleanup unnecessary debugfs dentries and surrounding functions.
v3: remove return value check for debugfs_create_file()
v2: remove ttm_debugfs_entries array.
do not init variables.
Signed-off-by: Nirmoy Das <nirmoy.das@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
There is still a warning when CONFIG_DEBUG_FS is disabled:
drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c:1145:13: error: 'amdgpu_ras_debugfs_create_ctrl_node' defined but not used [-Werror=unused-function]
1145 | static void amdgpu_ras_debugfs_create_ctrl_node(struct amdgpu_device *adev)
Change the code again to make the compiler actually drop
this code but not warn about it.
Fixes: ae2bf61ff3 ("drm/amdgpu: guard ras debugfs creation/removal based on CONFIG_DEBUG_FS")
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
In amdgpu_ras_reset_gpu, because bad pages may not be freed,
it has high probability to reserve bad pages failed.
Change to reserve bad pages when freeing VRAM.
v2:
1. avoid allocating the drm_mm node outside of amdgpu_vram_mgr.c
2. move bad page reserving into amdgpu_ras_add_bad_pages, if vram mgr
reserve bad page failed, it will put it into pending list, otherwise
put it into processed list;
3. remove amdgpu_ras_release_bad_pages, because retired page's info has
been moved into amdgpu_vram_mgr
v3:
1. formate code style;
2. rename amdgpu_vram_reserve_scope as amdgpu_vram_reservation;
3. rename scope_pending as reservations_pending;
4. rename scope_processed as reserved_pages;
5. change to iterate over all the pending ones and try to insert them
with drm_mm_reserve_node();
v4:
1. rename amdgpu_vram_mgr_reserve_scope as
amdgpu_vram_mgr_reserve_range;
2. remove unused include "amdgpu_ras.h";
3. rename amdgpu_vram_mgr_check_and_reserve as
amdgpu_vram_mgr_do_reserve;
4. refine amdgpu_vram_mgr_reserve_range to call
amdgpu_vram_mgr_do_reserve.
Reviewed-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Hawking Zhang <hawking.zhang@amd.com>
Signed-off-by: Dennis Li <Dennis.Li@amd.com>
Signed-off-by: Wenhui Sheng <Wenhui.Sheng@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Because bad pages saving has been moved to UMC error interrupt callback,
which will trigger a new GPU reset after saving.
Signed-off-by: Dennis Li <Dennis.Li@amd.com>
Reviewed-by: Hawking Zhang <hawking.zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Instead of saving bad pages in amdgpu_ras_reset_gpu, it will reduce
the unnecessary calling of amdgpu_ras_save_bad_pages.
Signed-off-by: Dennis Li <Dennis.Li@amd.com>
Reviewed-by: Hawking Zhang <hawking.zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Once ras recovery is issued by ras sync flood interrupt or
ras controller interrupt, add this guard to bypass or execute
ras error count register harvest of all IPs.
Signed-off-by: Guchun Chen <guchun.chen@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Dennis Li <Dennis.Li@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>