Gpu has fallen off the bus ошибка - TopOshibok.ru - решение и исправление самых разных ошибок

I recently made a post because I couldn’t get my NVIDIA GPU up and running. This is the post: link to my other post. I got my gpu working now (through NVIDIA X server settings). These are my specs:

ubuntu version: 16.04.1

GPU: NVIDIA Corporation GM108M [GeForce 840M]

But every time I suspend my laptop and reboot it I get a black screen with this error message:

[ 5107.273042] usbhid 2-3:1.0: suspend error -5  
[ 5107.644336] NVRM: Xid (PCI:0000:03:00): 79, GPU has fallen off the bus.)  
[ 5107.644336]

The only solution there is is to completely reboot my laptop with the power button.

— Extra information —

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 375.26                 Driver Version: 375.26                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce 840M        Off  | 0000:03:00.0     Off |                  N/A |
| N/A   47C    P0    N/A /  N/A |    242MiB /  2002MiB |     24%      Default |
+-------------------------------+----------------------+----------------------+
                                                                             
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    0      1409    G   /usr/lib/xorg/Xorg                             149MiB |
|    0      2471    G   compiz                                          92MiB |
|    0      2774    G   /usr/lib/firefox/firefox                         1MiB |
+-----------------------------------------------------------------------------+

Thus, it only happens when my laptop puts itself in sleep-/suspend-mode (I deactivated that now so it doesn’t go into sleep-/suspend-mode anymore). Powering up my laptop has no problems with this. Never encountered this before either. I also switched back to my Intel GPU to test if it still occurs, but it doesn’t. So it has to do something with my Nvidia GPU.

asked Jan 5, 2017 at 12:18

I had the exact same problem, I solved it by putting the graphics card into persistent mode:

$ sudo nvidia-smi -pm 1

I don’t know what this really does but it seems that it’s working for me.

I found the solution in the next forum: https://bbs.archlinux.org/viewtopic.php?id=145527

answered May 28, 2018 at 9:14

Update: There was a related bug on the ubuntu issue tracker that has since been fixed and released. Not sure if this answer is helpful anymore. https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1847937

Here is a viable solution that doesn’t require you to limit your low power state usage:

The fix is to add the following arguments to the kernel boot parameters:
rcutree.rcu_idle_gp_delay=1 acpi_osi=! acpi_osi='Windows 2009'
You can test this fix by rebooting and pressing «e» on your primary boot entry in grub. Add the arguments to the end of the line ending with linux and press CTRL+X to boot. Try suspending and waking the system. If it works, you’re golden! To make the fix permanent you need to edit your \etc\default\grub file:

Open a terminal window and paste the following command: sudo xed /etc/default/grub

Enter your password. Then, find the line that starts with GRUB_CMDLINE_LINUX_DEFAULT=

Add the arguments to the end of this line, inside the quotes. So it looks roughly like this: GRUB_CMDLINE_LINUX_DEFAULT="quiet splash rcutree.rcu_idle_gp_delay=1 acpi_osi=! acpi_osi='Windows 2009'"

Run sudo update-grub

Allegedly, the successful result can be achieved by replacing Windows 2009 with Linux to clarify to BIOS what OS we are using. This did not work for me, but others commented below that it helped them. In my case, I left it as Windows 2009.

Source: https://forums.linuxmint.com/viewtopic.php?p=1728952&sid=d2f654dfa1082400eeea98c9fbf01918#p1728952

answered May 7, 2020 at 19:24

Next time, try to log in via SSH to halt/reboot your computer.
Other procedure would be to press magic+r to unbind keyboard from X and pressing cltr+alt+del.

I have the same problem with this version of driver.
Try the ubuntu driver package!

answered Jan 8, 2017 at 20:27

Tried everything.
Only one thing helped: disable the ASPM.
Add this to kernel boot arguments: pcie_aspm=off

answered Jan 31, 2020 at 15:03

znd0znd0

393 bronze badges

Having the same issue on Ubuntu 18.04, I’m using nvidia-prime for graphics switching with nvidia-driver-396(.24) installed. This issue only occurs when running on the dedicated card using:

sudo prime-select nvidia

On recovering from suspend, the desktop flashes up then black screens as mentioned above with the very same error message.

Hardware (Dell inspiron 7559):
Nvidia GTX 960m
Intel i7-6700HQ

Word Around:

A fix that worked for me, was to delete the default swapfile made during install and create a dedicated swap partition, of course remember to add to fstab and direct grub to the partition with resume=»UUID».

answered Jul 13, 2018 at 12:55

None of the solutions here and anywhere else helped for me. There was, though, one thing I noticed.

I was getting the PCIe bus error severity=corrected messages (lots of them) in dmesg. This most likely caused the NVIDIA driver to crash. So, I found more information about this error, and found this article. My solution was to edit /etc/default/grub and add pci=nomsi to kernel arguments (by editing this line):

GRUB_CMDLINE_LINUX_DEFAULT="quiet splash pci=nomsi"

Save the file, and run sudo update-grub. Then, reboot and the NVIDIA driver shouldn’t have any issues.

I haven’t tested everything with this solution, but the NVIDIA driver finally started on boot, and even suspend was working. I’ll edit this post if I experience any issues again.

Edit: This didn’t help… it just made the drivers working for one boot, then I experienced the issue again. I asked for this as another question: GPU has fallen off the bus (on almost every boot)

answered Jan 2, 2021 at 21:39

adazem009adazem009

9823 gold badges14 silver badges26 bronze badges

I found out that Ubuntu 18.04 gives an option of using Ubuntu on Wayland, which I learnt is an alternative to the x server.
I logged in using the Ubuntu on Wayland option on the login screen:

Now I can use the suspend option without any problem.

answered Dec 27, 2018 at 19:38

I had a similar issue on Linux Mint.
Most forum posts suggested that this is something to do with Active State Power Management (ASPM) on Linux that doesn’t play well with the NVidia driver. They suggested that you turn off ASPM in the boot options. However that doesn’t work if you have ASPM off in your bios. I finally got it to work by turning on ASPM in bios but turning it off in the boot options — so Linux knows that its controlling ASPM and not bios.

answered Jun 19, 2019 at 4:06

You must log in to answer this question.

Not the answer you’re looking for? Browse other questions tagged

.

Источник

Предлагаю тему для разбора и решения возникающих проблем

Опишу свою ситуацию. Риг работал 1,5 месяца без особых проблем и на высокой скорости, конфиг не менялся, но начала отваливаться карта.
Поменял райзер — проблема осталась
менял блоки питания — проблема осталась
менял местами в PCI-E слотах — отваливается
поменял карту на такую-же — проблема с отвалом осталась
поменял карту на такую-же, но другого производителя — проблема осталась.
Тепер конфиг мой

Дрова
NVIDIA-SMI 384.111 Driver Version: 384.111
Мать
Asus B250 MINING EXPERT, BIOS 0401 08/30/2017
Система
Linux w0 4.10.0-42-generic #46-Ubuntu SMP Mon Dec 4 14:38:01 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

журнал ошибки майнера EBFW 0.3.4b

Код:

INFO: Detected new work: 64d1
ERROR: Looks like GPU3 are stopped. Restart attempt.
ERROR: Looks like GPU3 are stuck he not respond.
ERROR: Looks like GPU3 are stopped. Restart attempt.
CUDA: Device: 5 Thread exited with code: 6
CUDA: Device: 1 Thread exited with code: 6
CUDA: Device: 2 Thread exited with code: 6
CUDA: Device: 9 Thread exited with code: 6
etc..

00:58:02 worker1 kernel: [ 2269.319832] NVRM: GPU at PCI:0000:0a:00: GPU-e3cebabe-58e8-f16d-cc76-bacc069cd24a
Feb 1 00:58:02 worker1 kernel: [ 2269.319838] NVRM: GPU Board Serial Number:
Feb 1 00:58:02 worker1 kernel: [ 2269.319840] NVRM: Xid (PCI:0000:0a:00): 8, Channel 00000018
Feb 1 00:58:03 worker1 kernel: [ 2270.327462] NVRM: Xid (PCI:0000:0a:00): 38, 0001 0000902d 00000000 00000000 00000000 00000000
Feb 1 00:58:03 worker1 kernel: [ 2270.329992] NVRM: Xid (PCI:0000:0a:00): 38, 0001 0000902d 00000000 00000000 00000000 00000000
Feb 1 00:58:03 worker1 kernel: [ 2270.383637] NVRM: Xid (PCI:0000:0a:00): 38, 0001 0000902d 00000000 00000000 00000000 00000000
Feb 1 00:58:04 worker1 kernel: [ 2271.821908] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
Feb 1 00:58:06 worker1 kernel: [ 2273.821986] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
Feb 1 00:58:07 worker1 kernel: [ 2274.042257] show_signal_msg: 26 callbacks suppressed
Feb 1 00:58:07 worker1 kernel: [ 2274.042258] miner[3063]: segfault at 7fea94ffd9d0 ip 00007feaa6117829 sp 00007fff4e6353f8 error 4 in libpthread-2.24.so[7feaa6109000+18000]
Feb 1 00:58:08 worker1 kernel: [ 2275.822042] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
Feb 1 00:58:10 worker1 kernel: [ 2277.822097] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
Feb 1 00:58:12 worker1 kernel: [ 2279.822156] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
Feb 1 00:58:14 worker1 kernel: [ 2281.822214] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
Feb 1 00:58:16 worker1 kernel: [ 2283.822269] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
Feb 1 00:58:18 worker1 kernel: [ 2285.822324] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
Feb 1 00:58:20 worker1 kernel: [ 2287.822372] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
Feb 1 00:58:22 worker1 kernel: [ 2289.822421] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
Feb 1 00:58:24 worker1 kernel: [ 2291.822480] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
Feb 1 00:58:26 worker1 kernel: [ 2293.822538] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
Feb 1 00:58:28 worker1 kernel: [ 2295.822586] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
Feb 1 00:58:29 worker1 kernel: [ 2296.124241] NMI watchdog: BUG: soft lockup — CPU#1 stuck for 22s! [swapper/1:0]
Feb 1 00:58:29 worker1 kernel: [ 2296.124242] Modules linked in: bnep nls_iso8859_1 eeepc_wmi asus_wmi sparse_keymap mxm_wmi intel_rapl x86_pkg_temp_thermal intel_powerclamp coretemp kvm nvidia_uvm(POE) irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc aesni_intel aes_x86_64 crypto_simd glue_helper cryptd input_leds serio_raw snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic snd_hda_intel snd_hda_codec snd_hda_core snd_hwdep snd_pcm snd_seq_midi snd_seq_midi_event snd_rawmidi snd_seq snd_seq_device snd_timer snd soundcore mei_me mei shpchp hci_uart btbcm btqca btintel bluetooth wmi intel_lpss_acpi intel_lpss mac_hid acpi_als acpi_pad kfifo_buf industrialio parport_pc ppdev lp parport ip_tables x_tables autofs4 hid_generic usbhid e1000e ptp pps_core nvidia_drm(POE) nvidia_modeset(POE) i915 nvidia(POE) i2c_algo_bit
Feb 1 00:58:29 worker1 kernel: [ 2296.124264] drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ahci drm libahci video i2c_hid hid fjes
Feb 1 00:58:29 worker1 kernel: [ 2296.124268] CPU: 1 PID: 0 Comm: swapper/1 Tainted: P OE 4.10.0-42-generic #46-Ubuntu
Feb 1 00:58:29 worker1 kernel: [ 2296.124268] Hardware name: System manufacturer System Product Name/B250 MINING EXPERT, BIOS 0401 08/30/2017
Feb 1 00:58:29 worker1 kernel: [ 2296.124269] task: ffff97d4ed2dc500 task.stack: ffffb18100cf4000
Feb 1 00:58:29 worker1 kernel: [ 2296.124370] RIP: 0010:_nv028492rm+0x13/0x30 [nvidia]
Feb 1 00:58:29 worker1 kernel: [ 2296.124370] RSP: 0018:ffff97d4f6c83b20 EFLAGS: 00000202 ORIG_RAX: ffffffffffffff10
Feb 1 00:58:29 worker1 kernel: [ 2296.124371] RAX: 0000000000000c20 RBX: 0000000010011001 RCX: 0000000000000c20
Feb 1 00:58:29 worker1 kernel: [ 2296.124372] RDX: ffffb18115000000 RSI: ffff97d4e5ef8008 RDI: ffff97d4e31c4008
Feb 1 00:58:29 worker1 kernel: [ 2296.124372] RBP: ffff97d4dc5cdb58 R08: ffff97d4e2651b88 R09: ffff97d4dc5cdb68
Feb 1 00:58:29 worker1 kernel: [ 2296.124372] R10: 0000000000000000 R11: 0000000000000000 R12: ffffffffc05fc18d
Feb 1 00:58:29 worker1 kernel: [ 2296.124373] R13: 0000000000000000 R14: ffff97d4e5ef8df8 R15: 0000000000000000
Feb 1 00:58:29 worker1 kernel: [ 2296.124373] FS: 0000000000000000(0000) GS:ffff97d4f6c80000(0000) knlGS:0000000000000000
Feb 1 00:58:29 worker1 kernel: [ 2296.124374] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Feb 1 00:58:29 worker1 kernel: [ 2296.124374] CR2: 00000000004b39ac CR3: 000000003e209000 CR4: 00000000003406e0
Feb 1 00:58:29 worker1 kernel: [ 2296.124375] Call Trace:
Feb 1 00:58:29 worker1 kernel: [ 2296.124375] <IRQ>
Feb 1 00:58:29 worker1 kernel: [ 2296.124507] ? _nv019935rm+0x153/0x190 [nvidia]
Feb 1 00:58:29 worker1 kernel: [ 2296.124623] ? _nv017489rm+0x57/0x1a0 [nvidia]
Feb 1 00:58:29 worker1 kernel: [ 2296.124735] ? _nv017451rm+0x144/0x1c0 [nvidia]
Feb 1 00:58:29 worker1 kernel: [ 2296.124848] ? _nv017451rm+0xcd/0x1c0 [nvidia]
Feb 1 00:58:29 worker1 kernel: [ 2296.124971] ? _nv021596rm+0x114/0x2a0 [nvidia]
Feb 1 00:58:29 worker1 kernel: [ 2296.125093] ? _nv021596rm+0xf2/0x2a0 [nvidia]
Feb 1 00:58:29 worker1 kernel: [ 2296.125199] ? _nv015872rm+0xcf/0x320 [nvidia]
Feb 1 00:58:29 worker1 kernel: [ 2296.125303] ? _nv016307rm+0x485/0x4b0 [nvidia]
Feb 1 00:58:29 worker1 kernel: [ 2296.125425] ? _nv021236rm+0xa59/0xf10 [nvidia]
Feb 1 00:58:29 worker1 kernel: [ 2296.125550] ? _nv021046rm+0x1c3/0x320 [nvidia]
Feb 1 00:58:29 worker1 kernel: [ 2296.125673] ? _nv031243rm+0x233/0x300 [nvidia]
Feb 1 00:58:29 worker1 kernel: [ 2296.125796] ? _nv031190rm+0x20a/0x3f0 [nvidia]
Feb 1 00:58:29 worker1 kernel: [ 2296.125918] ? _nv031190rm+0x1d9/0x3f0 [nvidia]
Feb 1 00:58:29 worker1 kernel: [ 2296.126040] ? _nv031191rm+0x6a5/0x830 [nvidia]
Feb 1 00:58:29 worker1 kernel: [ 2296.126162] ? _nv031275rm+0xd5/0xf0 [nvidia]
Feb 1 00:58:29 worker1 kernel: [ 2296.126284] ? _nv031277rm+0x417/0x590 [nvidia]
Feb 1 00:58:29 worker1 kernel: [ 2296.126405] ? _nv031276rm+0x51/0x160 [nvidia]
Feb 1 00:58:29 worker1 kernel: [ 2296.126497] ? _nv028731rm+0x192/0x1a0 [nvidia]
Feb 1 00:58:29 worker1 kernel: [ 2296.126547] ? nv_numa_offline_memory+0x10/0x10 [nvidia]
Feb 1 00:58:29 worker1 kernel: [ 2296.126632] ? rm_run_rc_callback+0x91/0xf0 [nvidia]
Feb 1 00:58:29 worker1 kernel: [ 2296.126682] ? nvidia_rc_timer_callback+0x53/0x90 [nvidia]
Feb 1 00:58:29 worker1 kernel: [ 2296.126731] ? nv_timer_callback_anon_data+0x7/0x10 [nvidia]
Feb 1 00:58:29 worker1 kernel: [ 2296.126733] ? call_timer_fn+0x35/0x140
Feb 1 00:58:29 worker1 kernel: [ 2296.126734] ? run_timer_softirq+0x215/0x4b0
Feb 1 00:58:29 worker1 kernel: [ 2296.126735] ? ktime_get+0x41/0xb0
Feb 1 00:58:29 worker1 kernel: [ 2296.126736] ? lapic_next_deadline+0x26/0x30
Feb 1 00:58:29 worker1 kernel: [ 2296.126737] ? clockevents_program_event+0x7f/0x120
Feb 1 00:58:29 worker1 kernel: [ 2296.126738] ? __do_softirq+0x104/0x2af
Feb 1 00:58:29 worker1 kernel: [ 2296.126740] ? irq_exit+0xb6/0xc0
Feb 1 00:58:29 worker1 kernel: [ 2296.126740] ? smp_apic_timer_interrupt+0x3d/0x50
Feb 1 00:58:29 worker1 kernel: [ 2296.126742] ? apic_timer_interrupt+0x89/0x90
Feb 1 00:58:29 worker1 kernel: [ 2296.126742] </IRQ>
Feb 1 00:58:29 worker1 kernel: [ 2296.126744] ? cpuidle_enter_state+0x122/0x2c0
Feb 1 00:58:29 worker1 kernel: [ 2296.126745] ? cpuidle_enter_state+0x110/0x2c0
Feb 1 00:58:29 worker1 kernel: [ 2296.126745] ? cpuidle_enter+0x17/0x20
Feb 1 00:58:29 worker1 kernel: [ 2296.126747] ? call_cpuidle+0x23/0x40
Feb 1 00:58:29 worker1 kernel: [ 2296.126747] ? do_idle+0x189/0x200
Feb 1 00:58:29 worker1 kernel: [ 2296.126748] ? cpu_startup_entry+0x71/0x80
Feb 1 00:58:29 worker1 kernel: [ 2296.126749] ? start_secondary+0x154/0x190
Feb 1 00:58:29 worker1 kernel: [ 2296.126750] ? start_cpu+0x14/0x14
Feb 1 00:58:29 worker1 kernel: [ 2296.126751] Code: 31 ff e8 71 13 00 00 48 89 c7 e8 59 c2 f9 ff 0f b7 c3 5b c3 0f 1f 40 00 53 31 db 39 4a 10 76 0f 48 8b 12 c1 e9 02 89 c8 8b 1c 82 <89> d8 5b c3 31 ff e8 42 13 00 00 48 89 c7 e8 2a c2 f9 ff 89 d8
Feb 1 00:58:30 worker1 kernel: [ 2297.822648] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
Feb 1 00:58:33 worker1 systemd[1837]: Starting Notification regarding a crash report…
Feb 1 00:58:33 worker1 systemd[1837]: Started Notification regarding a crash report.
Feb 1 00:58:41 worker1 kernel: [ 2308.583244] NVRM: Xid (PCI:0000:0a:00): 79, GPU has fallen off the bus.
Feb 1 00:58:41 worker1 kernel: [ 2308.583246] NVRM: GPU at 0000:0a:00.0 has fallen off the bus.
Feb 1 00:58:41 worker1 kernel: [ 2308.583248] NVRM: GPU is on Board .
Feb 1 00:58:41 worker1 kernel: [ 2308.633801] NVRM: A GPU crash dump has been created. If possible, please run
Feb 1 00:58:41 worker1 kernel: [ 2308.633801] NVRM: nvidia-bug-report.sh as root to collect this data before
Feb 1 00:58:41 worker1 kernel: [ 2308.633801] NVRM: the NVIDIA kernel module is unloaded.
Feb 1 00:58:42 worker1 fans.sh.desktop[2572]: ERROR: Error assigning value 100 to attribute ‘GPUTargetFanSpeed’
Feb 1 00:58:42 worker1 fans.sh.desktop[2572]: (worker1:0[fan:3]) as specified in assignment
Feb 1 00:58:42 worker1 fans.sh.desktop[2572]: ‘[fan:3]/GPUTargetFanSpeed=100’ (Unknown Error).
Feb 1 00:58:48 worker1 fans.sh.desktop[2572]: ERROR: Error assigning value 100 to attribute ‘GPUTargetFanSpeed’
Feb 1 00:58:48 worker1 fans.sh.desktop[2572]: (worker1:0[fan:3]) as specified in assignment
Feb 1 00:58:48 worker1 fans.sh.desktop[2572]: ‘[fan:3]/GPUTargetFanSpeed=100’ (Unknown Error).
Feb 1 00:58:53 worker1 systemd[1]: Reloaded OpenBSD Secure Shell server.
Feb 1 00:58:53 worker1 systemd[1]: Reloading.
Feb 1 00:58:53 worker1 systemd[1]: Reloading OpenBSD Secure Shell server.
Feb 1 00:58:53 worker1 systemd[1]: Reloaded OpenBSD Secure Shell server.
Feb 1 00:58:53 worker1 systemd[1]: Reloading.
Feb 1 00:58:54 worker1 fans.sh.desktop[2572]: ERROR: Error assigning value 100 to attribute ‘GPUTargetFanSpeed’
Feb 1 00:58:54 worker1 fans.sh.desktop[2572]: (worker1:0[fan:3]) as specified in assignment
Feb 1 00:58:54 worker1 fans.sh.desktop[2572]: ‘[fan:3]/GPUTargetFanSpeed=100’ (Unknown Error).
Feb 1 00:59:00 worker1 fans.sh.desktop[2572]: ERROR: Error assigning value 100 to attribute ‘GPUTargetFanSpeed’
Feb 1 00:59:00 worker1 fans.sh.desktop[2572]: (worker1:0[fan:3]) as specified in assignment
Feb 1 00:59:00 worker1 fans.sh.desktop[2572]: ‘[fan:3]/GPUTargetFanSpeed=100’ (Unknown Error).
Feb 1 00:59:05 worker1 fans.sh.desktop[2572]: ERROR: Error assigning value 100 to attribute ‘GPUTargetFanSpeed’
Feb 1 00:59:05 worker1 fans.sh.desktop[2572]: (worker1:0[fan:3]) as specified in assignment
…

Ошибка возникает всегда по одному и тому же адресу в не зависимости от: райзера/карты/питания/номера слота на маме.
Пока не решил и думаю – может посоветуете чего?

Последнее редактирование: 2 Фев 2018

а где логика? каков механизм «выздоровления»? и месяц оно работало без проблем вообще

а где логика? каков механизм «выздоровления»? и месяц оно работало без проблем вообще

ну как я понял ты все поменял, кроме самого линукса — значит дело в нем. А вообще, у меня было что я видяху разобрал, потекший силикон спиртом протер, собрал и ферма перестала виснуть.

Это единственное, что ты ещё не менял

У меня было такое 1 в 1. Пришлось вытаскивать все карты, включать по одной и тестировать в работе. В процессе выяснилось что навернулось:
— 1 переходник PCI-E — USB (который идет в комплекте с райзером)
— 1 переходник m.2

заменил — проблема исчезла.

У меня было такое 1 в 1. Пришлось вытаскивать все карты, включать по одной и тестировать в работе. В процессе выяснилось что навернулось:
— 1 переходник PCI-E — USB (который идет в комплекте с райзером)
— 1 переходник m.2

заменил — проблема исчезла.

дело в том, что я менял райзер комплектом с переходником и проводом, но все-равно спасибо

Выставил PCI-e в поколение 2?
Биос свежий на мамку шил?
Возможно проблема в мамке, я тоже боролся с отвалом карты, почему то последняя отваливалась. Заменил мамку на аналогичную, работает нормально.

Выставил PCI-e в поколение 2?
Биос свежий на мамку шил?
Возможно проблема в мамке, я тоже боролся с отвалом карты, почему то последняя отваливалась. Заменил мамку на аналогичную, работает нормально.

Биос от августа (см.выше BIOS 0401 08/30/2017) новее утилитой не доступен. Скорость шины gen2 поставил сразу же. Смена операционки на форточки дала снижение хэшрэйта и иные ошибки, но по прежнему адресу GPU 3.
Кстати! если на убунте это работало по 14-18 часов и лишь потом падало в ошибку, то в форточках я даже не подошел к своей производительности — ошибка выпадала сразу и форточки повисали.
Пойду обратно в .nix

Отпишусь о результатах.
Поставил Debian 9 xfce, драйвер поставил самый последний Driver Version: 390.25 скорость 7350-7400 солей. Работает стабильно и не зависает, блок термалтейк 1500 совсем не держит заявленых цифр ну и контакты подгорают, пришлось поджимать некоторые. в целом проблема решена, вот только кто-нить может сказать что мне ядро в syslog вывалило??? чтоб дальше понимать как дебажить проблемы?

см 1 сообщение, спойлер «что упало в syslog»

см 1 сообщение, спойлер «что упало в syslog»

NVRM: Xid (PCI:0000:0a:00): 79, GPU has fallen off the bus.

вышибло карту по адресу : PCI:0000:0a:00

если говоришь, что возникает в не зависимости от: райзера/карты/питания/номера слота на маме,

то похоже, что питания не хватает.

попробуй Ethash VS Equihash

есть на equihash валится будет, то питание 100%

NVRM: Xid (PCI:0000:0a:00): 79, GPU has fallen off the bus.

вышибло карту по адресу : PCI:0000:0a:00

Это я и сам прочитал, а вот про питание спасибо за подсказку, но если не трудно дай ссылку где почитать про алгоритмы и то как железо себя ведет при бруте разных алгоритмов.
Валится как раз на equihash

ну, по работам алгоритмов и сам погуглишь…

на equihash жор намного возрастает, задействуется CPU на всю мощь.

у меня уже просто опыт, 99% со временем питалово снижается от БП и начинаются глюки…

Ребят подскажите. Ферма на Linux . После конфигурации иксов, картинка (рабочий стол) начинает выводится на другую карту, т.е. при загрузке рига усе загружается до черного экрана, перетыкаю моник в другую карту там рабочий стол. проще линукс смещает рабочий стол как будто 2 монитора. На функционал не влияет но мне ненравится.
И еще на одном риге минт стоит, и когда захожу на нее через тим вивер зависает инет на этом риге. Зависает не сразу а через минуту примерно.

Ребят подскажите. Ферма на Linux . После конфигурации иксов, картинка (рабочий стол) начинает выводится на другую карту, т.е. при загрузке рига усе загружается до черного экрана, перетыкаю моник в другую карту там рабочий стол. проще линукс смещает рабочий стол как будто 2 монитора. На функционал не влияет но мне ненравится.
И еще на одном риге минт стоит, и когда захожу на нее через тим вивер зависает инет на этом риге. Зависает не сразу а через минуту примерно.

Linux считывает аппаратный адрес шины PCI-E и выводит картинку на первый или на наименьший адрес. У меня с адресацией на платах такие проблемы были.

Тимвьювер без настройки sudoers и прав может вообще насмерть вешать систему, а не только сеть.

Это я и сам прочитал, а вот про питание спасибо за подсказку, но если не трудно дай ссылку где почитать про алгоритмы и то как железо себя ведет при бруте разных алгоритмов.
Валится как раз на equihash

Что-то я не догнал в чем соль ошибки «GPU has fallen off the bus.» Диспетчер краша не указывает какая именно карта, да и тем более питания на весь риг с лихвой, аж с двух БП записаны карточки, но карта(ы) вылетают на пару минут раз в 1-2 неделю стабильно.

Что-то я не догнал в чем соль ошибки «GPU has fallen off the bus.» Диспетчер краша не указывает какая именно карта, да и тем более питания на весь риг с лихвой, аж с двух БП записаны карточки, но карта(ы) вылетают на пару минут раз в 1-2 неделю стабильно.

Посмотрите /var/log/syslog там указан какой PCI-E слот выпал, у меня типо такого было

Bash:

Jul 20 13:13:44 kernel: [ 4009.915597] NVRM: Xid (PCI:0000:04:00): 79, pid=0, GPU has fallen off the bus.
Jul 20 13:13:44 kernel: [ 4009.915597] NVRM: GPU 0000:04:00.0: GPU has fallen off the bus.
Jul 20 13:13:44 kernel: [ 4009.915597] NVRM: GPU 0000:04:00.0: GPU is on Board .
Jul 20 13:13:44 kernel: [ 4010.471591] NVRM: A GPU crash dump has been created. If possible, please run
Jul 20 13:13:44 kernel: [ 4010.471591] NVRM: nvidia-bug-report.sh as root to collect this data before
Jul 20 13:13:44 kernel: [ 4010.471591] NVRM: the NVIDIA kernel module is unloaded.

Источник

I recently made a post because I couldn’t get my NVIDIA GPU up and running. This is the post: link to my other post. I got my gpu working now (through NVIDIA X server settings). These are my specs:

ubuntu version: 16.04.1

GPU: NVIDIA Corporation GM108M [GeForce 840M]

But every time I suspend my laptop and reboot it I get a black screen with this error message:

[ 5107.273042] usbhid 2-3:1.0: suspend error -5  
[ 5107.644336] NVRM: Xid (PCI:0000:03:00): 79, GPU has fallen off the bus.)  
[ 5107.644336]

The only solution there is is to completely reboot my laptop with the power button.

— Extra information —

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 375.26                 Driver Version: 375.26                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce 840M        Off  | 0000:03:00.0     Off |                  N/A |
| N/A   47C    P0    N/A /  N/A |    242MiB /  2002MiB |     24%      Default |
+-------------------------------+----------------------+----------------------+
                                                                             
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    0      1409    G   /usr/lib/xorg/Xorg                             149MiB |
|    0      2471    G   compiz                                          92MiB |
|    0      2774    G   /usr/lib/firefox/firefox                         1MiB |
+-----------------------------------------------------------------------------+

asked Jan 5, 2017 at 12:18

I had the exact same problem, I solved it by putting the graphics card into persistent mode:

$ sudo nvidia-smi -pm 1

I don’t know what this really does but it seems that it’s working for me.

I found the solution in the next forum: https://bbs.archlinux.org/viewtopic.php?id=145527

answered May 28, 2018 at 9:14

Here is a viable solution that doesn’t require you to limit your low power state usage:

The fix is to add the following arguments to the kernel boot parameters:
rcutree.rcu_idle_gp_delay=1 acpi_osi=! acpi_osi='Windows 2009'
You can test this fix by rebooting and pressing «e» on your primary boot entry in grub. Add the arguments to the end of the line ending with linux and press CTRL+X to boot. Try suspending and waking the system. If it works, you’re golden! To make the fix permanent you need to edit your \etc\default\grub file:

Open a terminal window and paste the following command: sudo xed /etc/default/grub

Enter your password. Then, find the line that starts with GRUB_CMDLINE_LINUX_DEFAULT=

Add the arguments to the end of this line, inside the quotes. So it looks roughly like this: GRUB_CMDLINE_LINUX_DEFAULT="quiet splash rcutree.rcu_idle_gp_delay=1 acpi_osi=! acpi_osi='Windows 2009'"

Run sudo update-grub

Source: https://forums.linuxmint.com/viewtopic.php?p=1728952&sid=d2f654dfa1082400eeea98c9fbf01918#p1728952

answered May 7, 2020 at 19:24

Next time, try to log in via SSH to halt/reboot your computer.
Other procedure would be to press magic+r to unbind keyboard from X and pressing cltr+alt+del.

I have the same problem with this version of driver.
Try the ubuntu driver package!

answered Jan 8, 2017 at 20:27

Tried everything.
Only one thing helped: disable the ASPM.
Add this to kernel boot arguments: pcie_aspm=off

answered Jan 31, 2020 at 15:03

znd0znd0

393 bronze badges

Having the same issue on Ubuntu 18.04, I’m using nvidia-prime for graphics switching with nvidia-driver-396(.24) installed. This issue only occurs when running on the dedicated card using:

sudo prime-select nvidia

On recovering from suspend, the desktop flashes up then black screens as mentioned above with the very same error message.

Hardware (Dell inspiron 7559):
Nvidia GTX 960m
Intel i7-6700HQ

Word Around:

answered Jul 13, 2018 at 12:55

None of the solutions here and anywhere else helped for me. There was, though, one thing I noticed.

GRUB_CMDLINE_LINUX_DEFAULT="quiet splash pci=nomsi"

Save the file, and run sudo update-grub. Then, reboot and the NVIDIA driver shouldn’t have any issues.

I haven’t tested everything with this solution, but the NVIDIA driver finally started on boot, and even suspend was working. I’ll edit this post if I experience any issues again.

answered Jan 2, 2021 at 21:39

adazem009adazem009

9823 gold badges14 silver badges26 bronze badges

I found out that Ubuntu 18.04 gives an option of using Ubuntu on Wayland, which I learnt is an alternative to the x server.
I logged in using the Ubuntu on Wayland option on the login screen:

Now I can use the suspend option without any problem.

answered Dec 27, 2018 at 19:38

answered Jun 19, 2019 at 4:06

You must log in to answer this question.

Not the answer you’re looking for? Browse other questions tagged

.

Источник

I recently purchased a new GPU card for my server, but have not been successful in getting the nvidia drivers to work with it — either as a graphics card, or as a cuda compute engine. The card in question is a Nvidia GeForce GT 710. (https://www.zotac.com/us/product/graphi … b-pcie-x-1) (The motherboard I’m using is an Asus PRIME B250M-A, if that matters.)

I have all the needed packages installed:

[darose@darsys12 ~]$ pacman -Q | grep nvidia
nvidia-dkms 465.31-1
nvidia-settings 465.31-1
nvidia-utils 465.31-1
opencl-nvidia 465.31-1

The machine definitely *sees* the card:

[darose@darsys12 ~]$ lspci | grep VGA
01:00.0 VGA compatible controller: NVIDIA Corporation GK208B [GeForce GT 710] (rev a1)

But no matter what I try, I keep getting «GPU has fallen off the bus» errors:

[darose@darsys12 ~]$ dmesg | grep -i -e nvidia -e nvrm
[    0.000000] Command line: BOOT_IMAGE=../vmlinuz-linux root=LABEL=root rw nvidia-drm.modeset=1 initrd=../intel-ucode.img,../initramfs-linux.img
[    0.046539] Kernel command line: BOOT_IMAGE=../vmlinuz-linux root=LABEL=root rw nvidia-drm.modeset=1 initrd=../intel-ucode.img,../initramfs-linux.img
[    1.099115] nvidia: loading out-of-tree module taints kernel.
[    1.099123] nvidia: module license 'NVIDIA' taints kernel.
[    1.109036] nvidia: module verification failed: signature and/or required key missing - tainting kernel
[    1.132441] nvidia-nvlink: Nvlink Core is being initialized, major device number 239
[    1.133265] nvidia 0000:01:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=io+mem
[    1.249990] NVRM: loading NVIDIA UNIX x86_64 Kernel Module  465.31  Thu May 13 22:24:36 UTC 2021
[    1.253004] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms  465.31  Thu May 13 22:14:23 UTC 2021
[    1.253833] nvidia_uvm: module uses symbols from proprietary module nvidia, inheriting taint.
[    1.262431] nvidia-uvm: Loaded the UVM driver, major device number 237.
[    1.284546] [drm] [nvidia-drm] [GPU ID 0x00000100] Loading driver
[    1.692351] NVRM: GPU at PCI:0000:01:00: GPU-94909028-e642-5b1e-39c9-b3c510991de2
[    1.692354] NVRM: Xid (PCI:0000:01:00): 79, pid=140, GPU has fallen off the bus.
[    1.692355] NVRM: GPU 0000:01:00.0: GPU has fallen off the bus.
[    1.696537] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x26:0xf:1242)
[    1.696593] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0
[    1.696712] [drm:nv_drm_load [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00000100] Failed to allocate NvKmsKapiDevice
[    1.696916] [drm:nv_drm_probe_devices [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00000100] Failed to register device
...

And the nvidia drivers can’t access it:

[darose@darsys12 ~]$ sudo nvidia-smi 
No devices were found
[darose@darsys12 ~]$ sudo clinfo 
Number of platforms                               0

That «GPU has fallen off the bus» is a pretty vague error message, and there’s not a ton of good suggestions on the Net about how to fix this. Several pages suggested either that the card wasn’t seated correctly, or it’s a hardware error. But I tried re-seating the card again (as well as installing it in a different slot) but that made no difference. As far as a hardware error, that seems unlikely too, as a) this is a brand new card, and b) it works *perfectly* when I use the nouveau driver instead (though without any compute capabilities, obviously).

I’ve tried upgrading and downgrading kernels, upgrading and downgrading the nvidia packages, booting using «rcutree.rcu_idle_gp_delay=1 pcie_aspm=off» and numerous other things, but nothing seems to fix the issue.

Yes, I can get the graphics working on the card if I need to (using nouveau), but that doesn’t really help me. This is a headless server, and so doesn’t really make any use of graphics capabilities, and I bought the card in order to use it for compute purposes.

Been wrestling with this for several days now, and I’m pretty frustrated and at my wits end on this.

Any suggestions on how to fix or debug this further would be most welcome!

Last edited by darose (2021-07-13 22:07:31)

Источник

NVIDIA Developer Forums

Источник