Featured image of post Linuxでモニターが落ちる問題

Linuxでモニターが落ちる問題

目次

背景

  • ローカルのマシンで機械学習する時に10%ぐらいの確率でモニターが落ちる
  • GPUの負荷をかけている事が原因だと思うが、その根因を探した時のメモ

現象と前提

  • 機械学習をすると、モニターが落ちる
  • 負荷はRTX 3090 で GPU 57% / 252W / Fan 100% / 69°C 、VRAM は 4.8GB程度
  • 負荷そのものは“激重”ではないはずだが偶に落ちる
  • おそらく、GPUドライバが一瞬リセットしただけでモニター側も巻き込まれている
  • PCはOMEN 45Lを使っている
  • Cooler Master 800W 80 Plus Gold ATXの電源を使っているので、電力は十分

モニターが動いている時のNVTOP

sshして原因調査

nvidia-smi

SSHしたらGPUがドライバーから見えない状態だった。

1
2
3
nvidia-smi

Unable to determine the device handle for GPU0000:01:00.0: Unknown Error

Xid 79は「PCI Express経由でGPUにアクセスしようとして、GPUがアクセス不能だったらしい。

ログチェック

カーネルリングバッファのチェック。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
dmesg -T | grep -Ei 'NVRM|Xid|fallen|pcie|aer|nvidia|drm'

[Sun May 24 10:40:36 2026] NVRM: _issueRpcAndWait: rpcSendMessage failed with status 0x0000000f for fn 76 sequence 622212!
[Sun May 24 10:40:36 2026] NVRM: _issueRpcAndWait: rpcSendMessage failed with status 0x0000000f for fn 76 sequence 622213!
[Sun May 24 10:40:37 2026] NVRM: _issueRpcAndWait: rpcSendMessage failed with status 0x0000000f for fn 76 sequence 622214!
[Sun May 24 10:40:37 2026] NVRM: _issueRpcAndWait: rpcSendMessage failed with status 0x0000000f for fn 76 sequence 622215!
[Sun May 24 10:40:37 2026] NVRM: _issueRpcAndWait: rpcSendMessage failed with status 0x0000000f for fn 76 sequence 622216!
[Sun May 24 10:40:38 2026] NVRM: _issueRpcAndWait: rpcSendMessage failed with status 0x0000000f for fn 76 sequence 622217!
[Sun May 24 10:40:38 2026] NVRM: _issueRpcAndWait: rpcSendMessage failed with status 0x0000000f for fn 76 sequence 622218!
[Sun May 24 10:40:38 2026] NVRM: _issueRpcAndWait: rpcSendMessage failed with status 0x0000000f for fn 76 sequence 622219!
[Sun May 24 10:40:39 2026] NVRM: _issueRpcAndWait: rpcSendMessage failed with status 0x0000000f for fn 76 sequence 622220!
[Sun May 24 10:40:39 2026] NVRM: _issueRpcAndWait: rpcSendMessage failed with status 0x0000000f for fn 76 sequence 622221!
[Sun May 24 10:40:39 2026] NVRM: _issueRpcAndWait: rpcSendMessage failed with status 0x0000000f for fn 76 sequence 622222!
[Sun May 24 10:40:40 2026] NVRM: _issueRpcAndWait: rpcSendMessage failed with status 0x0000000f for fn 76 sequence 622223!
[Sun May 24 10:40:40 2026] NVRM: _issueRpcAndWait: rpcSendMessage failed with status 0x0000000f for fn 76 sequence 622224!
[Sun May 24 10:40:40 2026] NVRM: _issueRpcAndWait: rpcSendMessage failed with status 0x0000000f for fn 76 sequence 622225!
[Sun May 24 10:40:41 2026] NVRM: _issueRpcAndWait: rpcSendMessage failed with status 0x0000000f for fn 76 sequence 622226!
[Sun May 24 10:40:41 2026] NVRM: _issueRpcAndWait: rpcSendMessage failed with status 0x0000000f for fn 76 sequence 622227!
[Sun May 24 10:40:41 2026] NVRM: _issueRpcAndWait: rpcSendMessage failed with status 0x0000000f for fn 76 sequence 622228!
[Sun May 24 10:40:42 2026] NVRM: _issueRpcAndWait: rpcSendMessage failed with status 0x0000000f for fn 76 sequence 622229!
[Sun May 24 10:40:42 2026] NVRM: _issueRpcAndWait: rpcSendMessage failed with status 0x0000000f for fn 76 sequence 622230!
[Sun May 24 10:40:42 2026] NVRM: _issueRpcAndWait: rpcSendMessage failed with status 0x0000000f for fn 76 sequence 622231!
[Sun May 24 10:40:43 2026] NVRM: _issueRpcAndWait: rpcSendMessage failed with status 0x0000000f for fn 76 sequence 622232!
[Sun May 24 10:40:43 2026] NVRM: _issueRpcAndWait: rpcSendMessage failed with status 0x0000000f for fn 76 sequence 622233!
[Sun May 24 10:40:43 2026] NVRM: _issueRpcAndWait: rpcSendMessage failed with status 0x0000000f for fn 76 sequence 622234!

systemctlのログcheck。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
journalctl -k -b --no-pager | grep -Ei 'NVRM|Xid|fallen|pcie|aer|nvidia|drm'

May 22 11:54:22 omen kernel: ACPI FADT declares the system doesn't support PCIe ASPM, so disable it
May 22 11:54:22 omen kernel: acpi PNP0A08:00: _OSC: OS now controls [PCIeHotplug SHPCHotplug PME AER PCIeCapability LTR DPC]
May 22 11:54:22 omen kernel: pci 0000:00:01.0: [8086:460d] type 01 class 0x060400 PCIe Root Port
May 22 11:54:22 omen kernel: pci 0000:00:0e.0: [8086:467f] type 00 class 0x010400 PCIe Root Complex Integrated Endpoint
May 22 11:54:22 omen kernel: pci 0000:00:14.3: [8086:7af0] type 00 class 0x028000 PCIe Root Complex Integrated Endpoint
May 22 11:54:22 omen kernel: pci 0000:00:1c.0: [8086:7abf] type 01 class 0x060400 PCIe Root Port
May 22 11:54:22 omen kernel: pci 0000:01:00.0: [10de:2204] type 00 class 0x030000 PCIe Legacy Endpoint
May 22 11:54:22 omen kernel: pci 0000:01:00.1: [10de:1aef] type 00 class 0x040300 PCIe Endpoint
May 22 11:54:22 omen kernel: pci 0000:02:00.0: [10ec:8168] type 00 class 0x020000 PCIe Endpoint
May 22 11:54:22 omen kernel: pcieport 0000:00:01.0: PME: Signaling with IRQ 121
May 22 11:54:22 omen kernel: pcieport 0000:00:1c.0: PME: Signaling with IRQ 122
May 22 11:54:22 omen kernel: ACPI: bus type drm_connector registered
May 22 11:54:22 omen kernel: [drm] Initialized simpledrm 1.0.0 20200625 for simple-framebuffer.0 on minor 0
May 22 11:54:22 omen kernel: simple-framebuffer simple-framebuffer.0: [drm] fb0: simpledrmdrmfb frame buffer device
May 22 11:54:22 omen kernel: pci 10000:e0:1d.4: [8086:7ab4] type 01 class 0x060400 PCIe Root Port
May 22 11:54:22 omen kernel: pci 10000:e1:00.0: [15b7:5011] type 00 class 0x010802 PCIe Endpoint
May 22 11:54:22 omen kernel: r8169 0000:02:00.0 eth0: RTL8168h/8111h, 6c:02:e0:50:bd:64, XID 541, IRQ 150
May 22 11:54:22 omen kernel: pcieport 10000:e0:1d.4: can't derive routing for PCI INT A
May 22 11:54:22 omen kernel: pcieport 10000:e0:1d.4: PCI INT A: no GSI
May 22 11:54:22 omen kernel: pcieport 10000:e0:1d.4: PME: Signaling with IRQ 151
May 22 11:54:22 omen kernel: pcieport 10000:e0:1d.4: AER: enabled with IRQ 151
May 22 11:54:22 omen kernel: pcieport 10000:e0:1d.4: can't derive routing for PCI INT A
May 22 11:54:22 omen systemd[1]: Starting modprobe@drm.service - Load Kernel Module drm...
May 22 11:54:22 omen systemd[1]: modprobe@drm.service: Deactivated successfully.
May 22 11:54:22 omen systemd[1]: Finished modprobe@drm.service - Load Kernel Module drm.
May 22 11:54:22 omen kernel: nvidia: loading out-of-tree module taints kernel.
May 22 11:54:22 omen kernel: nvidia-nvlink: Nvlink Core is being initialized, major device number 234
May 22 11:54:22 omen kernel: nvidia 0000:01:00.0: vgaarb: VGA decodes changed: olddecodes=io+mem,decodes=none:owns=none
May 22 11:54:22 omen kernel: NVRM: loading NVIDIA UNIX Open Kernel Module for x86_64  580.126.09  Release Build  (dvs-builder@U22-I3-AM02-24-3)  Wed Jan  7 22:51:36 UTC 2026
May 22 11:54:22 omen kernel: nvidia-modeset: Loading NVIDIA UNIX Open Kernel Mode Setting Driver for x86_64  580.126.09  Release Build  (dvs-builder@U22-I3-AM02-24-3)  Wed Jan  7 22:33:56 UTC 2026
May 22 11:54:22 omen kernel: [drm] [nvidia-drm] [GPU ID 0x00000100] Loading driver
May 22 11:54:22 omen kernel: input: HDA NVidia HDMI/DP,pcm=3 as /devices/pci0000:00/0000:00:01.0/0000:01:00.1/sound/card2/input15
May 22 11:54:22 omen kernel: input: HDA NVidia HDMI/DP,pcm=7 as /devices/pci0000:00/0000:00:01.0/0000:01:00.1/sound/card2/input16
May 22 11:54:22 omen kernel: input: HDA NVidia HDMI/DP,pcm=8 as /devices/pci0000:00/0000:00:01.0/0000:01:00.1/sound/card2/input17
May 22 11:54:22 omen kernel: input: HDA NVidia HDMI/DP,pcm=9 as /devices/pci0000:00/0000:00:01.0/0000:01:00.1/sound/card2/input18
May 22 11:54:24 omen kernel: [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:01:00.0 on minor 1
May 22 11:54:24 omen kernel: nvidia 0000:01:00.0: vgaarb: deactivate vga console
May 22 11:54:24 omen kernel: fbcon: nvidia-drmdrmfb (fb0) is primary device
May 22 11:54:24 omen kernel: nvidia 0000:01:00.0: [drm] fb0: nvidia-drmdrmfb frame buffer device
May 24 01:03:49 omen kernel: NVRM: nvCheckOkFailedNoLog: Check failed: Out of memory [NV_ERR_NO_MEMORY] (0x00000051) returned from _memdescAllocInternal(pMemDesc) @ mem_desc.c:1359
May 24 01:03:49 omen kernel: NVRM: nvCheckOkFailedNoLog: Check failed: Out of memory [NV_ERR_NO_MEMORY] (0x00000051) returned from rmStatus @ system_mem.c:356
May 24 01:03:49 omen kernel: NVRM: nvAssertOkFailedNoLog: Assertion failed: Out of memory [NV_ERR_NO_MEMORY] (0x00000051) returned from pRmApi->Alloc(pRmApi, device->session->handle, isSystemMemory ? device->handle : device->subhandle, &physHandle, isSystemMemory ? NV01_MEMORY_SYSTEM : NV01_MEMORY_LOCAL_USER, &memAllocParams, sizeof(memAllocParams)) @ nv_gpu_ops.c:4968
May 24 10:15:51 omen kernel: nvidia 0000:01:00.0: Using 39-bit DMA addresses
May 24 10:37:09 omen kernel: NVRM: GPU at PCI:0000:01:00: GPU-8558e31f-4cd3-7537-da51-8230e2f11c68
May 24 10:37:09 omen kernel: NVRM: GPU Board Serial Number: PKWUQ0C9VGJ1AZ
May 24 10:37:09 omen kernel: NVRM: Xid (PCI:0000:01:00): 79, GPU has fallen off the bus.
May 24 10:37:09 omen kernel: NVRM: GPU 0000:01:00.0: GPU has fallen off the bus.
May 24 10:37:09 omen kernel: NVRM: GPU 0000:01:00.0: GPU serial number is PKWUQ0C9VGJ1AZ.
May 24 10:37:09 omen kernel: NVRM: kgspRcAndNotifyAllChannels_IMPL: RC all channels for critical error 79.
May 24 10:37:09 omen kernel: NVRM: _threadNodeCheckTimeout: API_GPU_ATTACHED_SANITY_CHECK failed!
May 24 10:37:09 omen kernel: NVRM: _threadNodeCheckTimeout: API_GPU_ATTACHED_SANITY_CHECK failed!
May 24 10:37:09 omen kernel: NVRM: _threadNodeCheckTimeout: API_GPU_ATTACHED_SANITY_CHECK failed!
May 24 10:37:09 omen kernel: NVRM: _threadNodeCheckTimeout: API_GPU_ATTACHED_SANITY_CHECK failed!
May 24 10:37:09 omen kernel: NVRM: _threadNodeCheckTimeout: API_GPU_ATTACHED_SANITY_CHECK failed!
May 24 10:37:09 omen kernel: NVRM: _threadNodeCheckTimeout: API_GPU_ATTACHED_SANITY_CHECK failed!
May 24 10:37:09 omen kernel: NVRM: _threadNodeCheckTimeout: API_GPU_ATTACHED_SANITY_CHECK failed!
May 24 10:37:09 omen kernel: NVRM: _threadNodeCheckTimeout: API_GPU_ATTACHED_SANITY_CHECK failed!
May 24 10:37:09 omen kernel: NVRM: _threadNodeCheckTimeout: API_GPU_ATTACHED_SANITY_CHECK failed!

PCIバスの情報チェック。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
lspci -nnk | grep -A4 -Ei 'nvidia|vga|3d|display'

0000:01:00.0 VGA compatible controller [0300]: NVIDIA Corporation GA102 [GeForce RTX 3090] [10de:2204] (rev a1)
	Subsystem: Hewlett-Packard Company GA102 [GeForce RTX 3090] [103c:88d5]
	Kernel driver in use: nvidia
	Kernel modules: nvidiafb, nouveau, nvidia_drm, nvidia
0000:01:00.1 Audio device [0403]: NVIDIA Corporation GA102 High Definition Audio Controller [10de:1aef] (rev a1)
	Subsystem: Hewlett-Packard Company GA102 High Definition Audio Controller [103c:88d5]
	Kernel driver in use: snd_hda_intel
	Kernel modules: snd_hda_intel
0000:02:00.0 Ethernet controller [0200]: Realtek Semiconductor Co., Ltd. RTL8111/8168/8211/8411 PCI Express Gigabit Ethernet Controller [10ec:8168] (rev 16)

仮の結論

以下のログがあったので、xid79の問題だと断定。

1
2
NVRM: Xid (PCI:0000:01:00): 79, GPU has fallen off the bus.
NVRM: GPU 0000:01:00.0: GPU has fallen off the bus.
  • つまり、RTX 3090自体がPCIeバス上から脱落した
  • ドライバがPCI Express経由でGPUにアクセスしようとしたがGPUに到達できない状態

その原因は以下が考えられるらしい:

  • PCIeリンク障害
  • GPUハード故障
  • ドライバ問題

対策

電力量のCAP設定

現在の電力情報:

1
2
3
4
nvidia-smi --query-gpu=name,power.draw,power.limit,power.default_limit,power.min_limit,power.max_limit --format=csv

name, power.draw [W], power.limit [W], power.default_limit [W], power.min_limit [W], power.max_limit [W]
NVIDIA GeForce RTX 3090, 33.32 W, 350.00 W, 350.00 W, 100.00 W, 350.00 W

指標の意味:

  • power.draw 現在の消費電力
  • power.limit 現在設定されているPower Limit
  • power.default_limit デフォルトのPower Limit
  • power.min_limit 設定可能な最小値
  • power.max_limit 設定可能な最大値

とりあえず、電源を350W使わないようにlimitを変更。

1
sudo nvidia-smi -pl 280

確認

1
2
3
4
nvidia-smi --query-gpu=name,power.draw,power.limit,power.default_limit,power.min_limit,power.max_limit --format=csv

name, power.draw [W], power.limit [W], power.default_limit [W], power.min_limit [W], power.max_limit [W]
NVIDIA GeForce RTX 3090, 32.41 W, 280.00 W, 350.00 W, 100.00 W, 350.00 W

ドライバー変更

  • NVIDIA UNIX Open Kernel Module 580.126.09 を使っている
  • NVIDIAのopen kernel moduleはTuring以降でGSPに依存する設計らしい
  • nvidia-open ではなく proprietary の nvidia ドライバに変えるのはかなりありらしい

Ubuntuなので、open 付きではない通常版を使う方向で変更すればOK。

1
ubuntu-drivers devices

対策の検証方法

  • 真因はわからなかったが、現象はわかった
  • 機械学習のアブレーション的に仮説を潰していくしかない

方法:

  • コンビネーション
    • 今の所、2つ打ち手があるので、2c0=1(現状)、2c1=2、2c2=1 なので、全部で4パターン
    • baselineを除いて、それらを実行して検証する
  • グリッドサーチ
    • Power Limit: なし / 250W / 280W
    • Driver: nvidia-open / proprietary
    • つまり、より細かく見ていく
  • 二分法
    • ハードウェアとソフトウェアでレイヤーが違うのでそれを切り分ける
    • これはコードのバグ調査でよくやるアイディア
    • 探索時間がlognのオーダーで終わる

まとめ

  • 難があって有り難いではないが、知識が深まる
  • 試しに280WでCAPを置いたら安定している(n=3)

参考文献

Built with Hugo
テーマ StackJimmy によって設計されています。