linux/drivers/net/ethernet/broadcom
David Christensen 6c4ca03bd8 net/tg3: resolve deadlock in tg3_reset_task() during EEH
During EEH error injection testing, a deadlock was encountered in the tg3
driver when tg3_io_error_detected() was attempting to cancel outstanding
reset tasks:

crash> foreach UN bt
...
PID: 159    TASK: c0000000067c6000  CPU: 8   COMMAND: "eehd"
...
 #5 [c00000000681f990] __cancel_work_timer at c00000000019fd18
 #6 [c00000000681fa30] tg3_io_error_detected at c00800000295f098 [tg3]
 #7 [c00000000681faf0] eeh_report_error at c00000000004e25c
...

PID: 290    TASK: c000000036e5f800  CPU: 6   COMMAND: "kworker/6:1"
...
 #4 [c00000003721fbc0] rtnl_lock at c000000000c940d8
 #5 [c00000003721fbe0] tg3_reset_task at c008000002969358 [tg3]
 #6 [c00000003721fc60] process_one_work at c00000000019e5c4
...

PID: 296    TASK: c000000037a65800  CPU: 21  COMMAND: "kworker/21:1"
...
 #4 [c000000037247bc0] rtnl_lock at c000000000c940d8
 #5 [c000000037247be0] tg3_reset_task at c008000002969358 [tg3]
 #6 [c000000037247c60] process_one_work at c00000000019e5c4
...

PID: 655    TASK: c000000036f49000  CPU: 16  COMMAND: "kworker/16:2"
...:1

 #4 [c0000000373ebbc0] rtnl_lock at c000000000c940d8
 #5 [c0000000373ebbe0] tg3_reset_task at c008000002969358 [tg3]
 #6 [c0000000373ebc60] process_one_work at c00000000019e5c4
...

Code inspection shows that both tg3_io_error_detected() and
tg3_reset_task() attempt to acquire the RTNL lock at the beginning of
their code blocks.  If tg3_reset_task() should happen to execute between
the times when tg3_io_error_deteced() acquires the RTNL lock and
tg3_reset_task_cancel() is called, a deadlock will occur.

Moving tg3_reset_task_cancel() call earlier within the code block, prior
to acquiring RTNL, prevents this from happening, but also exposes another
deadlock issue where tg3_reset_task() may execute AFTER
tg3_io_error_detected() has executed:

crash> foreach UN bt
PID: 159    TASK: c0000000067d2000  CPU: 9   COMMAND: "eehd"
...
 #4 [c000000006867a60] rtnl_lock at c000000000c940d8
 #5 [c000000006867a80] tg3_io_slot_reset at c0080000026c2ea8 [tg3]
 #6 [c000000006867b00] eeh_report_reset at c00000000004de88
...
PID: 363    TASK: c000000037564000  CPU: 6   COMMAND: "kworker/6:1"
...
 #3 [c000000036c1bb70] msleep at c000000000259e6c
 #4 [c000000036c1bba0] napi_disable at c000000000c6b848
 #5 [c000000036c1bbe0] tg3_reset_task at c0080000026d942c [tg3]
 #6 [c000000036c1bc60] process_one_work at c00000000019e5c4
...

This issue can be avoided by aborting tg3_reset_task() if EEH error
recovery is already in progress.

Fixes: db84bf43ef ("tg3: tg3_reset_task() needs to use rtnl_lock to synchronize")
Signed-off-by: David Christensen <drc@linux.vnet.ibm.com>
Reviewed-by: Pavan Chebbi <pavan.chebbi@broadcom.com>
Link: https://lore.kernel.org/r/20230124185339.225806-1-drc@linux.vnet.ibm.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-01-25 22:35:42 -08:00
..
bnx2x Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2022-11-29 13:04:52 -08:00
bnxt bnxt: Do not read past the end of test names 2023-01-20 12:52:29 +00:00
genet net: bcmgenet: Remove the unused function 2022-12-09 19:46:52 -08:00
Kconfig net: broadcom: Add PTP_1588_CLOCK_OPTIONAL dependency for BCMGENET under ARCH_BCM2835 2022-11-30 20:37:03 -08:00
Makefile net: ethernet: bgmac: Remove -Warray-bounds exception 2022-10-07 08:50:07 +01:00
b44.c net: Remove the obsolte u64_stats_fetch_*_irq() users (drivers). 2022-10-28 20:13:54 -07:00
b44.h
bcm63xx_enet.c net: ethernet: broadcom: bcm63xx_enet: Drop empty platform remove function 2022-12-30 07:28:49 +00:00
bcm63xx_enet.h
bcm4908_enet.c net: broadcom: bcm4908_enet: report queued and transmitted bytes 2022-11-02 20:38:04 -07:00
bcm4908_enet.h
bcmsysport.c net: systemport: Add support for RDMA overflow statistic counter 2022-10-31 20:05:03 -07:00
bcmsysport.h net: systemport: Add support for RDMA overflow statistic counter 2022-10-31 20:05:03 -07:00
bgmac-bcma-mdio.c net: ethernet: bgmac: Fix refcount leak in bcma_mdio_mii_register 2022-06-06 14:38:15 -07:00
bgmac-bcma.c net: bgmac: Fix an erroneous kfree() in bgmac_remove() 2022-06-14 19:16:36 -07:00
bgmac-platform.c Revert "net: ethernet: bgmac: Use devm_platform_ioremap_resource_byname" 2022-02-17 08:45:34 -08:00
bgmac.c net: bgmac: Drop free_netdev() from bgmac_enet_remove() 2022-11-11 19:48:35 -08:00
bgmac.h net: bgmac: remove a copy of the NAPI_POLL_WEIGHT define 2022-04-29 11:56:41 +01:00
bnx2.c skbuff: Introduce slab_build_skb() 2022-12-09 19:47:41 -08:00
bnx2.h
bnx2_fw.h
cnic.c dma-mapping updates for Linux 2.6 2022-12-13 09:05:19 -08:00
cnic.h
cnic_defs.h
cnic_if.h
sb1250-mac.c eth: switch to netif_napi_add_weight() 2022-05-08 11:33:57 +01:00
tg3.c net/tg3: resolve deadlock in tg3_reset_task() during EEH 2023-01-25 22:35:42 -08:00
tg3.h
unimac.h