Майнер выдает ошибку cuda error out of memory - TopOshibok.ru - решение и исправление самых разных ошибок

Риг 7 карт 1660 ti
NBminer 42.2 (на версии 39,5 тоже самое)
Win 10
8 гигов озу.
Файл подкачки 60 гигов. Увеличение не помогает!!!
Майнер стартует и выдает ошибку про нехватку памяти. Прилагаю скрин ошибки и файла подкачки.
Майнер обновлял, менял на феню 6.2, все также ошибка.
Прошу помощи, как победить это безобразие. Риг на удалёнке.

П.С. поиском пользовался. Все только про подкачку говорят.

Скрины

IMG_20220626_212256.jpg

1,5 МБ · Просмотры: 151
IMG_20220626_212229.jpg

1,6 МБ · Просмотры: 155

таже проблема на 1660супер
уже час пробую разные варианты, не помогают…

карта, которая выводит изображение, уже не может загрузить даг файл в память

карта, которая выводит изображение, уже не может загрузить даг файл в память

нифига, изображение выдает встройка

Тоже феникс отлетел час назад, перешел на gminer.

тогда хз, у меня работает 1660ti, но изображение выводит карта на 8 гб.)

Тоже феникс отлетел час назад, перешел на gminer.

у меня gminer последний 3.01 стоял, на нем и начались ошибки

Риг на удалёнке. В карте стоит заглушка. Подключаюсь энидеском…

Тоже час назад словил : Cuda Error: out of memory, заработала на NBminer 40.1.

Dag подкрался незаметно, хоть виден был издалека.

да рано ещё 6гиговкам отваливаться
в 23 году вроде

На 3060 тоже старый феникс отвалился. Поставил Phoenix Miner 6.2c — заработало.

да рано ещё 6гиговкам отваливаться
в 23 году вроде

Не забывай про ось.

у меня gminer последний 3.01 стоял, на нем и начались ошибки

Ерунда какая-то. У меня риг на хайве стабильно работал, 6 карт 570 8 гб, майнер последний феникс начал ругаться на dag, попробовал nbminer и gminer все ок.

У меня на винде тоже отвалился (PhoenixMiner_5.5c_Windows) обновил на версию (PhoenixMiner_5.6d_Windows) все заработало.

Попробую видео драйвер обновить. Сейчас 425.31 стоят. Качаю 472…

Тоже заметил, даг 4.9 пишет но 1660с стала терять хеш, у 8 гиговок только 1,64гб свободно, у 12гб только 5гб свободно а 7 занято, при даге в 4.9 напомню. Стоит последний тирекс, вызываю пояснительную бригаду!

У меня на винде тоже отвалился (PhoenixMiner_5.5c_Windows) обновил на версию (PhoenixMiner_5.6d_Windows) все заработало.

а почему не на актуальную версию 6.2? обновлять так до актуальной

вы тупые? феня все написал

2022.06.26:23:58:49.823: GPU1 GPU1: Allocating DAG for epoch #501 (4.91) GB
2022.06.26:23:58:49.828: GPU1 GPU1: Generating DAG for epoch #501
2022.06.26:23:58:49.828: GPU1 GPU1: Unable to generate DAG for epoch #501; please upgrade to the latest version of PhoenixMiner
2022.06.26:23:58:49.828: GPU1 GPU1 initMiner error: Unable to initialize CUDA miner
2022.06.26:23:58:49.828: wdog Fatal error detected. Restarting.

Источник

CAN’T FIND NONCE WITH DEVICE CUDA_ERROR_LAUNCH_FAILED

Ошибка майнера Can’t find nonce

Ошибка говорит о том, что майнер не может найти нонс и сразу же сам предлагает решение — уменьшить разгон. Особенно начинающие майнеры стараются выжать из видеокарты максимум — разгоняют слишком сильно по ядру или памяти. В таком разгоне видеокарта даже может запуститься, но потом выдавать ошибки как указано ниже. Помните, лучше — стабильная отправка шар на пул, чем гонка за цифрами в майнере.

PHOENIXMINER CONNECTION TO API SERVER FAILED — ЧТО ДЕЛАТЬ?

Ошибка Connection to API server failed

Такая ошибка встречается на PhoenixMiner на операционной систему HiveOS. Она говорит о том, что майнинг-ферма/риг не может подключиться к серверу статистики. Что делать для ее решения:

Введите команду net-test и запомните/запишите сервер с низким пингом. После чего смените его в веб интерфейсе Hive (на воркере) и перезагрузите ваш риг.
Если это не помогло, выполните команду dnscrypt -i && sreboot

PHOENIXMINER CUDA ERROR IN CUDAPROGRAM.CU:474 : THE LAUNCH TIMED OUT AND WAS TERMINATED (702)

Ошибка майнера Phoenixminer CUDA error in CudaProgram

Эта ошибка, как и в первом случае, говорит о переразгоне карты. Откатите видеокарту до заводских настроек и постепенно поднимайте разгон до тех пор, пока не будет ошибки.

UNABLE TO ENUM CUDA GPUS: INVALID DEVICE ORDINAL

Ошибка майнера Unable to enum CUDA GPUs: invalid device ordinal

Проверяем драйвера видеокарты и саму видеокарту на работоспособность (как она отмечена в диспетчере устройств, нет ли восклицательных знаков).
Если все ок, то проверяем райзера. Часто бывает, что именно райзер бывает причиной такой ошибки.

UNABLE TO ENUM CUDA GPUS: INSUFFICIENT CUDA DRIVER: 5000

Ошибка майнера Unable to enum CUDA GPUs: Insufficient CUDA driver: 5000

Аналогично предыдущей ошибке — проверяем драйвера видеокарты и саму видеокарту на работоспособность (как она отмечена в диспетчере устройств, нет ли восклицательных знаков).

NBMINER MINING PROGRAM UNEXPECTED EXIT.CODE: -1073740791, REASON: PROCESS CRASHED

Ошибка майнера NBMINER MINING PROGRAM UNEXPECTED EXIT.CODE: -1073740791, REASON: PROCESS CRASHED

Ошибка code 1073740791 nbminer возникает, если ваш риг/майнинг-ферма собраны из солянки Nvidia+AMD. В этом случае разделите майнинг на два .bat файла (или полетника, если вы на HiveOS). Один — с картами AMD, другой с картами Nvidia.

NBMINER CUDA ERROR: OUT OF MEMORY (ERR_NO=2) — как исправить?

Ошибка майнера NBMINER CUDA ERROR: OUT OF MEMORY (ERR_NO=2)

Одна из самых распространённых ошибок на Windows — нехватка памяти, в данном случае на майнере Nbminer, но встречается и в майнере Nicehash. Чтобы ее исправить — надо увеличить файл подкачки. Файл подкачки должен быть равен сумме гб всех видеокарт в риге плюс 10% запаса. Как увеличить файл подкачки — читаем тут.

GMINER ERROR ON GPU: OUT OF MEMORY STOPPED MINING ON GPU0

Ошибка майнера GMINER ERROR ON GPU: OUT OF MEMORY STOPPED MINING ON GPU0

В данном случае скорее всего виноват не файл подкачки, а переразгон по видеокарте, которая идет под номером 0. Сбавьте разгон и ошибка должна пропасть.

Socket error. the remote host closed the connection, в майнере Nbminer

Socket error. the remote host closed the connection

Также может быть описана как «ERROR — Failed to establish connection to mining pool: Socket operation timed out».
Сетевой конфликт — проверьте соединение рига с интернетом. Перегрузите роутер.
Также может быть, что провайдер закрывает соединение с пулом. Смените пул, попробуйте VPN или измените адреса DNS на внешнего провайдера, например cloudflare 1.1.1.1, 1.0.0.1

Server not responded on share, на майнере Gminer

Server not responded on share

Такая ошибка говорит о том, что у вас что-то с подключением к интернету, что критично для Gminer. Попробуйте сделать рестарт роутера и отключить watchdog на майнере.

DAG has been damaged check overclocking settings, в майнере Gminer

Также в этой ошибке может быть указано Device not responding, check overclocking settings.
Ошибка говорит о переразгоне, попробуйте сначала убавить его.
Если это не помогло, смените майнер — Gminer никогда не славился работой с видеокартами AMD. Мы рекомендуем поменять майнер на Teamredminer, а если вам критична поддержка майнером одновременно Nvidia и AMD видеокарт, то используйте Lolminer.
Если смена майнера не поможет, переставьте видеодрайвер.
Если и это не поможет, то нужно тестировать эту карту отдельно в слоте X16.

ERROR: Can’t start T-Rex, failed to initialize device map: can’t get busid, code -6

Ошибки настройки памяти с кодом -6 обычно указывают на проблему с драйвером.

Если у вас Windows, используйте программу DDU (DisplayDriverUninstaller), чтобы полностью удалить все драйверы Nvidia.
Перезагрузите систему.
Установите новый драйвер прямо с сайта Nvidia.
Перезагрузите систему снова.
Если у вас HiveOS/RaveOS — накатите чистый образ системы. Чтобы наверняка. 🙂

TREX: Can’t unlock GPU

Полный текст ошибки:
TREX: Can’t unlock GPU [ID=1, GPU #1], error code 15
WARN: Miner is going to shutdown…
WARN: NVML: can’t get fan speed for GPU #1, error code 15
WARN: NVML: can’t get power for GPU #1, error code 15
WARN: NVML: can’t get mem/core clock for GPU #1, error code 17

Решение:

Проверьте все кабельные соединения видеокарты и райзера, особенно кабеля питания.
Если с первый пунктом все ок, попробуйте поменять райзер на точно рабочий.
Если ошибка остается, вставьте видеокарту в разъем х16 напрямую в материнскую плату.

CAN’T START MINER, FAILED TO INITIALIZE DEVIS MAP, CAN’T GET BUSID, CODE -6

Ошибка майнера CAN’T START MINER, FAILED TO INITIALIZE DEVIS MAP, CAN’T GET BUSID, CODE -6

В конкретном случае была проблема в блоке питания, он не держал 3 видеокарты. После замены блока питания ошибка пропала.
Если вы уверены, что ваш мощности вашего блока питания достаточно, попробуйте сменить майнер.

ОШИБКА 511 ГРАДУСОВ НА ВИДЕОКАРТА

Ошибка 511 градусов видеокарта

Ошибка 511 говорит о неисправности райзера или питания карты. Проверьте все соединения. Для выявления неисправности рекомендуется запустить систему с одной картой. Протестировать, и затем добавлять по одной карте.

GPU driver error, no temps в HiveOS — что делать?

Вероятнее всего, вы получили эту ошибку, майнив на HiveOS. Причин ее появления может быть несколько — как софтовая, так и аппаратная (например райзер).
Можно попробовать обойтись малой кровью и вбить в HiveOS команду:
hive-replace -y —stable
Система по новой накатит стабильную версию HiveOS.

Если ошибка не уйдет — проверьте райзер.

GPU are lost, rebooting

Это не ошибка, а ее последствие. Что узнать какая ошибка приводит к перезагрузке карт, сделайте следующее:

Включите сохранение логов (по умолчанию они выключены) командой

logs-on

И перезагрузите риг.
После того как ошибка повторится можно будет скачать логи командами ниже.
Вы можете использовать следующую команду, чтобы загрузить логи майнера прямо с панели мониторинга;

message file «miner.log» -f=/var/log/miner/minername/minername.log

Итак, скажем, например, мне нужны логи TeamRedMiner
message file «teamredminer.log» -f=/var/log/miner/teamredminer/teamredminer.log

Отправленная командная строка будет выделена синим цветом. Загружаемый файл будет отображаться белым цветом. Нажав на него, вы сможете его скачать.
Эта команда позволит скачать лог системы

message file «syslog» -f=/var/log/syslog

exitcode=3 в HiveOS

Если ошибка не уйдет — проверьте райзер.

exitcode=1 в HiveOS

Данная ошибка возникает когда есть проблема с датой в биосе материнской платы (сбитое время) и (или) есть проблема с интернетом.
Если сбито время, то удаленно вы не сможете подключиться.
Тем не менее, обновление драйверов Nvidia должно пройти командой:

nvidia-driver-update —list

gpu fault detected 146

Скорее всего вы пытаетесь майнить с помощью Phoenix miner. Решения два:

Откатитесь на более старую версию, например на 5.4с
(Рекомендуемый вариант) Используйте Trex для видеокарт Nvidia и TeamRedMiner для AMD.

Waiting interface to come up — не работает VPN на HiveOS

Waiting interface to come up

Начните с логов, чтобы понять какая именно ошибка вызывает эту проблему.
Команды для получения логов:
systemctl status openvpn@client
journalctl -u openvpn@client -e —no-pager -n 100

Как узнать ip адрес воркера hive os

Самое простое — зайти в воркера и прокрутить страницу ниже видеокарт. Там будет указан Remote IP — это и есть внешний IP.
Альтернативный вариант — вы можете проверить ваш внешний айпи адрес hive через консоль Hive Shell:
Выполните одну из команд:
curl 2ip.ru
wget -qO- eth0.me
wget -qO- ipinfo.io/ip
wget -qO- ipecho.net/plain
wget -qO- icanhazip.com
wget -qO- ipecho.net
wget -qO- ident.me

Repository update failed в HiveOS

Repository update failed

Иногда встречается на HiveOS. Полный текст ошибки:

Some index files failed to download. They have been ignored, or old ones used instead.
Repository update failed
------------------------------------------------------
> Restarting autofan and watchdog
> Starting miners
Miner screen is already running
Run miner or screen -r to resume screen
Upgrade failed

Решение:

Выполнить команду apt update && selfupgrade -f
Если не сработала и она, то 99.9%, что разработчики HiveOS уже знают об этой проблеме и решают ее. Попробуйте выполнить обновление через некоторое время.

Rave os не запускается. Boot aborted Rave os

Boot aborted Rave os

Перепроверьте все настройки ПК и БИОСа материнской платы:
— Установите загрузочное устройство HDD/SSD/M2/USB в зависимости от носителя с ОС.
— Включите 4G decoding.
— Установите поддержку PCIe на Auto.
— Включите встроенную графику.
— Установите предпочтительный режим загрузки Legacy mode.
— Отключите виртуализацию.

Если после данных настроек не определяется часть карт, то выполните следующие настройки в BIOS (после каждого пункта требуется полная перезагрузка):

— Отключите 4G decoding
— Перезагрузка
— Отключите CSM
— Перезагрузка
— Включите 4G decoding, установите PCI-E Gen2/3, а при отсутствии Gen2/3, можно выбрать Gen1

Failed to allocate memory Raveos

Эта же ошибка может называться как:
failed to allocate initramfs memory bailing out, failed to load idlinux c.32
или
failed to allocate memory for kernel boot parameter block
или
failed to allocate initramfs memory raveos bailing

Но решение у нее одно — вы должны правильно настроить БИОС материнской платы.

gpu_driver_fault, GPU #0 fault в RaveOS

gpu_driver_fault, GPU #0 fault в RaveOS

В большинстве случаев эта проблема решается уменьшением разгона (особенно по памяти) на конкретной видеокарте (на скрине это карта номер 0).
Если уменьшение разгона не помогает, то попробуйте обновить драйвера.
Если обновление драйверов не привело к решению проблемы, то попробуйте поменять райзер на этой карте на точно работающий.
Если и это не помогает, перепроверьте все кабельные соединения и мощность блока питания, хватает ли его для вашей конфигурации.

Gpu driver fault. All tasks have been stopped. Worker will be rebooted after 5 minutes в RaveOS

Gpu driver fault. All tasks have been stopped. Worker will be rebooted after 5 minutes

Что приводит к появлению этой ошибки? Вероятно, вы переразогнали видеокарту (часто сильно гонят по памяти), сбавьте разгон. На скрине видно, что проблему дает именно GPU под номером 1 — начните с нее.
Вторая частая причина — нехватка питания БП на систему с видеокартами. Учтите, что сама система потребляет не менее 100 вт, каждый райзер еще закладывайте 50 вт. БП должно хватать с запасом в 20%.

Miner restarted after error RaveOS

Смотрите логи майнера, там будет указана конкретная ошибка, которая приводит к miner restarted. После этого найдите ее на этой странице и исправьте. Проблема уйдет. 🙂

Miner restart limit reached. Worker rebooting by flag auto в RaveOS

Аналогично предыдущему пункту — смотрите логи майнера, там будет указана конкретная ошибка, которая приводит к рестарту воркера. Пофиксите ту ошибку — уйдет и эта проблема.

Miner cannot be started, ОС RaveOS

Непосредственно перед этой ошибкой обычно пишется еще другая, которая и вызывает эту проблему. Но если ничего нет, то:

Поставьте майнер на паузу, перезагрузите риг и в консоли выполните команды clear-miners clear-logs и fix-fs. Запустите майнинг.
Если ошибка не ушла, перепишите образ RaveOS.

Overclock can’t be applied в RaveOS

Эта ошибка означает, что значения разгона между собой конфликтуют или выходят за пределы допустимых. Перепроверьте их. Скиньте разгон на стоковый и попробуйте еще раз.
В редких случаях причиной этой ошибки также становится райзер.

Error installing hive miners

Error installing hive miners

Можно попробовать обойтись малой кровью и вбить в HiveOS команду:
hive-replace -y —stable
Система по новой накатит стабильную версию HiveOS.

Если ошибка не уйдет — физически перезапишите образ. Если у вас флешка, то скорее всего она умерла. Купите SSD. 🙂

Warning: Nvidia settings applied with errors

Переразгон. Снизьте значения частот ядра и памяти. После этого перезагрузите риг.

Nvtool error или Danger: nvtool error

Скорее всего при установке драйвера появилась проблема с модулем nvtool
Попробуйте переустановить драйвер Nvidia командой через Hive shell:
nvidia-driver-update версия_драйвера —force
Или попробуйте обновить систему полностью командой из Hive shell:
hive-replace -y —stable

nvtool error

Перестал отображаться кулер видеокарты HiveOS

0% скорости вращения кулера.
Это может произойти по нескольким причинам:

кулер действительно не крутится
датчик оборотов отключен или сломан
видеокарта слишком агрессивно работает (высокий разгон)
неисправен райзер или одно из его частей

ERROR: parsing JSON failed

Необходимо выполнить на риге локально (с клавиатурой и монитором) следующую команду:
net-test

Данная команда покажет ваше текущее состояние подключения к разным зеркалам API серверов HiveOS.
Посмотрите, к какому API у вас наименьшая задержка (ping), и когда воркер снова появится в панели, измените стандартное зеркало на то, что ближе к вам.
После смены зеркала, в обязательном порядке перезагрузите ваш воркер.
Изменить сервер API вы можете командой nano /hive-config/rig.conf
После смены нажмите ctrl + o и ентер для того чтобы сохранить файл.
После этого выйдите в консоль командой ctrl + x, f10 и выполните команду hello

NVML: can’t get fan speed for GPU #5, error code 999 hive os

Проблема с скоростью кулеров на GPU 5
0% скорости вращения кулера / ошибки в целом
Это может произойти по нескольким причинам:
— кулер действительно не крутится
— датчик оборотов отключен или сломан
— видеокарта слишком агрессивно работает (высокий разгон)
Начните с визуальной проверки карты и ее кулера.

Can’t get power for GPU #2

Как правило эта ошибка встречается рядом вместе с другими:
Attribute ‘GPUGraphicsClockOffset’ was already set to 0
Attribute ‘GPUMemoryTransferRateOffset’ was already set to 2200
Attribute ‘GPUFanControlState’ (hive1660s_ETH:0[gpu:2]) assigned value
0.

20211029 12:40:50 WARN: NVML: can’t get fan speed for GPU #2, error code 999
20211029 12:40:50 WARN: NVML: can’t get power for GPU #2, error code 999
20211029 12:40:50 WARN: NVML: can’t get mem/core clock for GPU #2, error code 999

Решение:

Проверьте корректность установки драйвера на видеокарте.
Убедитесь что нет проблем с драйвером, если все в порядке, то попробуйте другой параметр разгона. Например уменьшить разгон по памяти.

GPU1 search error: unspecified launch failure

Уменьшите разгон и проверьте контакты райзера

Warning: Autofan: unable to set fan speed, rebooting

Найдите логи майнера, посмотрите какие ошибки майнер пишет в логах. Например:

kernel: [12112.410046][ T7358] NVRM: GPU at PCI:0000:0c:00: GPU-236e3bef-2e03-6cdb-0518-7ac01eb8736d
kernel: [12112.410049][ T7358] NVRM: Xid (PCI:0000:0c:00): 62, pid=7317, 0000(0000) 00000000 00000000
kernel: [12112.433831][ T7358] NVRM: Xid (PCI:0000:0c:00): 45, pid=7317, Ch 00000010
CRON[21094]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)

Исходя из логов, мы видим что есть проблема с видеокартой на слоте PCIE 0c:00 (под номером Gpu пишется номер PCIE слота) с ошибками 45 и 62
Коды ошибок (других, которые также могут быть там) и что с ними делать:

• 13, 43, 45: ошибки памяти, снизить MEM
• 8, 31, 32, 61, 62: снизить CORE, возможно и MEM
• 79: снизить CORE, проверить райзер

Ошибка Kernel-Power код 41

Проверьте все провода (от БП до карт, от БП до райзеров), возможно где-то идёт оплавление. Если визуальный осмотр показал, что все ок, то ошибка программная и вам нужно переустановить Windows.

Danger: hive-replace -y —stable (failed, exitcode=137)

Очень редкая ошибка, которая вылезла в момент удаленного обновления образа HiveOS. Она не встречается в тематических майнинг группах и сайтах. Не поверите что произошло.
На балконе, где стоял риг, поселилась семья голубей. Они засрали риг, в прямом смысле, из-за этого он постоянно уходил в оффлайн. После полной продувки материнской платы и видеокарт проблема решилась сама.

MALFUNCTION HIVEOS

Malfunction Hiveos

Malfunction — неисправность. Причин и решений может быть несколько:

Вам следует переустановить видео драйвер;
Если драйвер не помог, тогда отключайте все GPU и поочередно вставляйте по 1 шт, и смотрите вызовет ли какая-то видеокарта подобную ошибку или нет. Если да, то возможно это райзер.
Неисправен носитель, на который записана Hive OS, запишите образ еще раз.

Не нашли своей ошибки? Помогите сделать мир майнинга лучше. Отправьте ее по этой форме и мы обновим наш гайд в самое ближайшее время.

Источник

As a data scientist or software engineer working with deep learning models you may have encountered the dreaded CUDA out of memory error This error occurs when the GPU memory is empty but the program still cannot allocate memory for a new operation This error can be frustrating to deal with especially when you have limited time to work on a project In this article we will discuss what causes the CUDA out of memory error and how to fix it

As a data scientist or software engineer working with deep learning models, you may have encountered the dreaded “CUDA out of memory” error. This error occurs when the GPU memory is empty, but the program still cannot allocate memory for a new operation. This error can be frustrating to deal with, especially when you have limited time to work on a project. In this article, we will discuss what causes the CUDA out of memory error and how to fix it.

What Causes the CUDA Out of Memory Error?

The CUDA out of memory error occurs when the GPU has insufficient memory to execute a particular operation. This error can be caused by several factors, including:

1. Model Size

The size of your model can significantly impact the amount of GPU memory required to run it. If your model is too large for your GPU memory, you may encounter the CUDA out of memory error.

2. Batch Size

The batch size is another critical factor that can determine the amount of GPU memory required to run a model. Larger batch sizes require more GPU memory, and if the batch size is too large for your GPU, you may encounter the CUDA out of memory error.

3. GPU Memory Leaks

GPU memory leaks can also cause the CUDA out of memory error. Memory leaks occur when a program fails to release memory, leading to a gradual reduction in available memory. Over time, this can cause the program to run out of memory.

How to Fix the CUDA Out of Memory Error?

Now that we know what causes the CUDA out of memory error let’s explore how to fix it.

1. Reduce the Model Size

One of the most effective ways to fix the CUDA out of memory error is to reduce the size of your model. You can do this by reducing the number of layers, parameters, or features. You can also consider using a pre-trained model, which can significantly reduce the size of your model.

2. Reduce the Batch Size

Another way to fix the CUDA out of memory error is to reduce the batch size. This can be done by reducing the number of samples fed into the model at once. While this may impact the model’s performance, it can significantly reduce the amount of GPU memory required to run the model.

3. Use Mixed Precision

Mixed precision is a technique that can significantly reduce the amount of GPU memory required to run a model. This technique involves using lower-precision floating-point numbers, such as half-precision (FP16), instead of single-precision (FP32). This can reduce the memory footprint of the model without significantly impacting the model’s performance.

4. Use Gradient Checkpointing

Gradient checkpointing is another technique that can help reduce the GPU memory required to run a model. This technique involves storing only a subset of the intermediate activations during the forward pass and recomputing the rest during the backward pass. This can significantly reduce the amount of GPU memory required to run the model.

5. Fix GPU Memory Leaks

If the CUDA out of memory error is caused by GPU memory leaks, you can fix it by identifying and fixing the leaks. This can be done by using profiling tools to identify the memory leaks and modifying the code to release memory correctly.

Conclusion

The CUDA out of memory error can be frustrating to deal with, but it is not insurmountable. By understanding what causes the error and using the techniques outlined in this article, you can effectively fix the error and continue working on your deep learning projects. Remember to always keep an eye on the size of your models and batch sizes, and use techniques such as mixed precision and gradient checkpointing to reduce the amount of GPU memory required to run your models.

About Saturn Cloud

Saturn Cloud is your all-in-one solution for data science & ML development, deployment, and data pipelines in the cloud. Spin up a notebook with 4TB of RAM, add a GPU, connect to a distributed cluster of workers, and more. Join today and get 150 hours of free compute per month.

Источник

This error message Runtimeerror: cuda out of memory is often encountered when the system is not able to allocate enough memory on the GPU to complete the requested operation.

In this article, we will provide you with a detailed understanding of the error message, “runtimeerror: cuda out of memory.” and how you can troubleshoot and fix it.

Why you encounter this error?

If you encounter the “runtimeerror: cuda out of memory” error message, it means that the GPU is running out of memory while processing a precise task.

Alternatively, the GPU has a limited amount of memory, and if the amount of memory needed by the project exceeds the available memory, the “cuda out of memory” error message will occur.

What are the Causes of the Error?

Here are some of the most common causes:

Insufficient Memory on the GPU
Large Batch Size
Complex Model Architecture

How to Solve the Error?

Time needed: 3 minutes

Here are the solutions to help you fix the runtimeerror cuda out of memory error message:

Solution 1: Reduce the Batch Size

One of the most effective solutions for addressing the “cuda out of memory” error message is to reduce the batch size.

If the batch size is reduced, the amount of memory needed to process each batch is also reduced, and the GPU can handle the task without running out of memory.
Solution 2: Upgrade the GPU

If reducing the batch size doesn’t resolve the issue, upgrading the GPU to one with more memory is a possible solution.

This option is only required if you usually encounter the “cuda out of memory” error message and require more processing power for your machine learning or deep learning tasks.
Solution 3: Simplify the Model Architecture

Simplifying the model architecture can also help resolve the “runtimeerror cuda out of memory” error message.

This can be done by reducing the number of layers, decreasing the number of neurons, or using a simpler architecture.
Solution 4: Use lower precision

Mixed precision training is a technique which can be used to reduce the memory requirements of deep learning models.

This technique uses lower precision data types for certain operations, reducing the amount of memory required for processing. This way can permanently reduce the memory requirements of deep learning models.
Solution 5: Use Gradient Checkpointing

Gradient checkpointing is another solution that can help reduce the memory requirements of deep learning models.

This method works by trading off computation time for memory by computing intermediate activations on-the-fly instead of storing them in memory.

This can significantly reduce the memory requirements of deep learning models and help avoid the “runtimeerror: cuda out of memory” error message.
Solution 6: Use Data Parallelism

Data parallelism is a method that can be used to distribute the workload across multiple GPUs.

This method will be able to help reduce the memory requirements of deep learning models by dividing the task into smaller sub-tasks that can be processed on multiple GPUs simultaneously.

This can help avoid the “runtimeerror: cuda out of memory” error message and improve the performance of your machine learning or deep learning tasks.
Solution 7: Use Memory Optimization Methods

Multiple memory optimization techniques can be used to reduce the memory requirements of deep learning models.

It consists of weight pruning, activation pruning, and quantization. Weight pruning involves removing redundant weights from the model, while activation pruning involves removing redundant activations.

Quantization involves reducing the precision of the model parameters to reduce the memory requirements.

FAQs

How do I know if I am running out of GPU memory?

You can check the GPU memory usage using the torch.cuda.memory_allocated() function. If the memory usage is close to the total memory available on your GPU, you are likely running out of GPU memory.

Can I fix the Runtimeerror: cuda out of memory. error by adding more RAM to my computer?

No, adding more RAM to your computer will not fix the Runtimeerror: cuda out of memory. error. This error is related to the memory on your GPU, not your computer’s RAM.

Can I use data parallelism if I only have one GPU?

No, data parallelism requires multiple GPUs to be effective. If you only have one GPU, you may need to try one of the other solutions to fix the Runtimeerror: cuda out of memory. error.

Additional Resources

The following articles resources will be helpful to you to understand more about Runtimerror:

Runtimeerror: cuda error: invalid device ordinal

Runtimeerror: dictionary changed size during iteration

Cannot add middleware after an application has started

Runtimeerror: expected scalar type float but found double

Runtimeerror: this event loop is already running

Conclusion

In conclusion, we discuss the causes of the error, and why it is occur and we provide some solutions that you may able to apply to solve the error.

Alternatively, we provide solutions that will help you to fix this error, consisting of reducing the batch size, upgrading the GPU, and simplifying the model architecture.

Also, using mixed precision training, gradient checkpointing, data parallelism, and using memory optimization techniques.

Remember to choose the solution that best fits your needs and the specific requirements of your project.

Источник

Hi,

Thanks for sharing this great work. I’m trying to run the samples on smaller gpu: GTX1060 6Gb.
The «Einstein» examples runs fine, but when I run the fox example I get:

13:49:05 SUCCESS  Loaded 50 images of size 1080x1920 after 1s
13:49:05 INFO       cam_aabb=[min=[1.0229,-1.33309,-0.378748], max=[2.46175,1.00721,1.41295]]
13:49:05 INFO     Loading network config from: configs\nerf\base.json
13:49:05 INFO     GridEncoding:  Nmin=16 b=1.51572 F=2 T=2^19 L=16
Warning: FullyFusedMLP is not supported for the selected architecture 61. Falling back to CutlassMLP. For maximum performance, raise the target GPU architecture to 75+.
Warning: FullyFusedMLP is not supported for the selected architecture 61. Falling back to CutlassMLP. For maximum performance, raise the target GPU architecture to 75+.
13:49:05 INFO     Density model: 3--[HashGrid]-->32--[FullyFusedMLP(neurons=64,layers=3)]-->1
13:49:05 INFO     Color model:   3--[SphericalHarmonics]-->16+16--[FullyFusedMLP(neurons=64,layers=4)]-->3
13:49:05 INFO       total_encoding_params=13074912 total_network_params=9728
13:49:06 ERROR    Uncaught exception: ***\dependencies\tiny-cuda-nn\include\tiny-cuda-nn/gpu_memory.h:531 cuMemSetAccess(m_base_address + m_size, n_bytes_to_allocate, &access_desc, 1) failed with error CUDA_ERROR_OUT_OF_MEMORY
Could not free memory: ***\dependencies\tiny-cuda-nn\include\tiny-cuda-nn/gpu_memory.h:452 cuMemAddressFree(m_base_address, m_max_size) failed with error CUDA_ERROR_INVALID_VALUE

Is it still possible to run this example with some modified parameters for gpu’s with lower memory, or should I give up?

Small note:
atomicAdd(__half2) is also not supported on my architecture (=61). I needed to disable it in «common_device.cuh».

You must be logged in to vote

When building I get:

D:***\include\neural-graphics-primitives/common_device.cuh(127): error : no instance of overloaded function "atomicAdd" matches the argument list [D:\***\build\ngp.vcxproj]
              argument types are: (__half2 *, {...})
            detected during instantiation of "void ngp::deposit_image_gradient(const Eigen::Matrix<float, N_DIMS, 1, <expression>, N_DIMS, 1> &, T *, T *, const Eigen::Vector2i &, const
   Eigen::Vector2f &) [with N_DIMS=2U, T=float]"
  D:\***\src\testbed_nerf.cu(1512): here

D:\***\include\neural-graphics-primitives/common_device.cuh(128): error : no instance of overloaded function "atomicAdd" matches the argument list [D:***\build\ngp.vcxproj]
…

View full answer

Hi there,

you might be able to further squeeze down the memory usage by reducing the resolution --width 1280 --height 720, but I’m unsure this will be enough.

Regarding atomicAdd(__half2): I’m surprised actually. How does this error manifest? I’d like to make this codebase work on as wide a range of GPUs as possible and both the CUDA documentation and CI suggest it should work on compute capability 61.

You must be logged in to vote

0 replies

When building I get:

D:***\include\neural-graphics-primitives/common_device.cuh(127): error : no instance of overloaded function "atomicAdd" matches the argument list [D:\***\build\ngp.vcxproj]
              argument types are: (__half2 *, {...})
            detected during instantiation of "void ngp::deposit_image_gradient(const Eigen::Matrix<float, N_DIMS, 1, <expression>, N_DIMS, 1> &, T *, T *, const Eigen::Vector2i &, const
   Eigen::Vector2f &) [with N_DIMS=2U, T=float]"
  D:\***\src\testbed_nerf.cu(1512): here

D:\***\include\neural-graphics-primitives/common_device.cuh(128): error : no instance of overloaded function "atomicAdd" matches the argument list [D:***\build\ngp.vcxproj]
              argument types are: (__half2 *, {...})
            detected during instantiation of "void ngp::deposit_image_gradient(const Eigen::Matrix<float, N_DIMS, 1, <expression>, N_DIMS, 1> &, T *, T *, const Eigen::Vector2i &, const
   Eigen::Vector2f &) [with N_DIMS=2U, T=float]"
  D:\***\src\testbed_nerf.cu(1512): here

Maybe it’s a wrong dependency problem, instead of the hardware not supporting it:
Win10
Cuda compilation tools, release 11.6, V11.6.55
Build cuda_11.6.r11.6/compiler.30794723_0
cmake version 3.23.0-rc1
Python 3.8.10

Running following command still gives me the same error:
./build/testbed.exe --scene data/nerf/fox --width 10 --height 10

But reducing the amount of photo’s to 20, makes it possible to run it.

You must be logged in to vote

4 replies

Thanks for elaborating on the atomicAdd(__half2) issue. Could you pass the --verbose argument to the cmake build command and paste the nvcc CLI? I’m curious whether this issue is happening on your end (i.e. CMake is passing a too low compute architecture to your compiler).

It’s indeed the same issue. I’m seeing -gencode=arch=compute_52,code=\"sm_52,compute_52\" while it’s still detects -DTCNN_MIN_GPU_ARCH=61. Adding the extra flags resolves it.

Can anyone elaborate on this? I couldn’t get what has been done to fix the issue. Would really appreciate it.

Can anyone elaborate on this? I couldn’t get what has been done to fix the issue. Would really appreciate it.

#204 (comment)

I just got the same error, running on a GTX 1080.
The fox does work for me, but when I try to run a dataset I prepared myself it gives this error. That was a very large dataset though, so I just tried shrinking it down and it still gives the same error. (smaller than the fox in MB’s at this point.)

edit: I forgot to add, that adding --width 10 --height 10 also does nothing for me.

You must be logged in to vote

1 reply

1050_ti runs out of memory as well

I’m also getting the same error regarding atomicAdd (Running on a GTX 1080 TI)

You must be logged in to vote

0 replies

What are the VRAM requirements for the provided examples after all?

You must be logged in to vote

0 replies

The VRAM requirements vary with architecture, older GPUs unfortunately requiring more RAM due to needing fp32 for efficiency and not being able to run fully fused neural networks.

In general, it seems that 8 GB are enough to run fox in all cases — so only a little push would be needed to make it fit into OP’s 6 GB card.

You must be logged in to vote

5 replies

Thanks for clarification.

In general, it seems that 8 GB are enough to run fox in all cases — so only a little push would be needed to make it fit into OP’s 6 GB card.

As I’ve said I’ve prepared my own dataset and I get this error no matter how far I optimize it, but the fox works every time. Unless I’m not really understanding how the VRAM requirements vary per dataset, it should be working, shouldn’t it?
Fox has 50 jpg images, totaling 17.5 MB
My own set I have now shrinked down to 33 jpg images totaling to 2 MB, but it will still say there’s not enough memory.

There are multiple aspects to this:

The size of the actual jpg files does not directly matter. What does matter is the resolution of the images, because instant-ngp loads the images into memory in uncompressed form. To figure out how much memory your images need, calculate n_bytes = n_images * width * height * 4 * 2. The number 4 is the number color channels internal to instant-ngp, and the number 2 refers to the fact that 2 bytes (fp16) are used to represent each channel.
The aabb_scale setting of the dataset incurs additional memory overhead. The higher the aabb_scale, the more memory is used for cascaded occupancy grids. The fox dataset uses "aabb_scale": 4. See the NeRF dataset tips for additional details about this parameter.

Cheers!

Aha! That was absolutely it! Changing the aabb_scale from 16 to 4 and rerunning the preparation makes my set fit inside the card 😁

Amazing work. Thank you! I am a university student in my first year and amazed at what you guys are doing. May I please one day create these kinds of cool things too.😆

Thank you for explaining that aspect @Tom94 . It worked on my 1060!

Same problem here

roman@DESKTOP-5B9D0K1 MINGW64 ~/Documents/instant-ngp (master)
$ ./build/testbed.exe --scene data/nerf/fox --width 10 --height 10
18:05:03 INFO     Loading NeRF dataset from
18:05:03 INFO       data\nerf\fox\transforms.json
18:05:03 SUCCESS  Loaded 12 images of size 1080x1920 after 0s
18:05:03 INFO       cam_aabb=[min=[1.47019,-1.33309,0.171354], max=[2.38974,-0.344117,0.314917]]
18:05:03 INFO     Loading network config from: configs\nerf\base.json
18:05:03 INFO     GridEncoding:  Nmin=16 b=1.51572 F=2 T=2^19 L=16
Warning: FullyFusedMLP is not supported for the selected architecture 52. Falling back to CutlassMLP. For maximum performance, raise the target GPU architecture to 75+.
Warning: FullyFusedMLP is not supported for the selected architecture 52. Falling back to CutlassMLP. For maximum performance, raise the target GPU architecture to 75+.
Warning: FullyFusedMLP is not supported for the selected architecture 52. Falling back to CutlassMLP. For maximum performance, raise the target GPU architecture to 75+.
18:05:03 INFO     Density model: 3--[HashGrid]-->32--[FullyFusedMLP(neurons=64,layers=3)]-->1
18:05:03 INFO     Color model:   3--[Composite]-->16+16--[FullyFusedMLP(neurons=64,layers=4)]-->3
18:05:03 INFO       total_encoding_params=13074912 total_network_params=9728
18:05:05 ERROR    Uncaught exception: C:\Users\roman\Documents\instant-ngp\dependencies\tiny-cuda-nn\include\tiny-cuda-nn/gpu_memory.h:558 cuMemSetAccess(m_base_address + m_size, n_bytes_to_allocate, &access_desc, 1) failed with error CUDA_ERROR_OUT_OF_MEMORY
Could not free memory: C:\Users\roman\Documents\instant-ngp\dependencies\tiny-cuda-nn\include\tiny-cuda-nn/gpu_memory.h:462 cuMemAddressFree(m_base_address, m_max_size) failed with error CUDA_ERROR_INVALID_VALUE

$ nvidia-smi
Sun Apr  3 18:06:21 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 511.65       Driver Version: 511.65       CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name            TCC/WDDM | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Quadro M2200       WDDM  | 00000000:01:00.0 Off |                  N/A |
| N/A   37C    P8    N/A /  N/A |      0MiB /  4096MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

You must be logged in to vote

1 reply

Running a 3090 I’ve found the limit for different resolutions and developed a short cut for knowing how many images I need. Take the total pixel count of the image and multiply by 11.77. Divide your vram by that number and you will get a good estimate for how many photos you can handle. Your card has 4GB of Vram I believe. You have to subtract what your system uses to operate. At 1920 x 1080 you only have room for 13 photos at 4GB. There are 2 things you can do. Decrease the size of the photos until the process runs, or run your monitor from the video out of your motherboard if you have one. Hopefully you can free up enough vram to make the fox run at current resolution.

Same error here CUDA_ERROR_OUT_OF_MEMORY, Could not free memory: CUDA_ERROR_INVALID_VALUE.

I tried to remove all but two images to create a minimal setup, still fails.

You must be logged in to vote

0 replies

Hey, thank you for your great work!
My GPU is 3060 (12G) and I am trying to reconstruct self-collected scenes. All the images are 2976*3968 and transformed by colmap2nerf.py. Each scene has about 100 images and imgs in json file varies from 20 to 50. Some scenes can be reconstructed and others cannot.
14:36:06 INFO Loading NeRF dataset from 14:36:06 INFO /media/hyx/code/instant-ngp/data/xgrids/goact/transforms.json 14:36:07 SUCCESS Loaded 97 images of size 2976x3968 after 0s 14:36:07 INFO cam_aabb=[min=[0.394222,-0.798393,0.153654], max=[1.87405,1.80321,1.1615]] 14:36:07 ERROR Uncaught exception: Could not allocate memory: /media/hyx/code/instant-ngp/dependencies/tiny-cuda-nn/include/tiny-cuda-nn/gpu_memory.h:112 cudaMalloc(&rawptr, n_bytes+DEBUG_GUARD_SIZE*2) failed with error out of memory
I tried to change the aabb_scale and add params such like —width 10 —height 10 or specify the json file but it did not work. How can I compute how many images can I reconstruct? Why did I success on some scenes which have more images but fail in scenes with less images?
Thank you for your time.

You must be logged in to vote

0 replies

I’m getting the same error when I try to run «G:\NERF\instant-ngp>build\testbed.exe —scene data/toy_truck —width 10 —height 10»
I get the following error:

"failed with error CUDA_ERROR_OUT_OF_MEMORY, could not free memory ...../gpu_memory.h:465 cuMemAddressFree(m_base_address, m_max_size) failed with error CUDA_ERROR_INVALID_VALUE"
I’m trying to run it on NVIDIA GeForce GTX 980 with 16GBs RAM.
I tried hard coding
list(APPEND CUDA_NVCC_FLAGS "-gencode=arch=compute_52,code=\"sm_52,compute_52\"") in my CMakeLists.txt files but still no luck. I’ve followed the instructions and everything worked and compiled and ran fine but still, I get this error. I’d appreciate it if someone could help me with this.

You must be logged in to vote

0 replies

let me add my GPU too, with same issue. GTX970. but its kinda in supported list. it showing architecture 52..

You must be logged in to vote

1 reply

GeForce GT 1030 here with 2gb of vram (architecture 61). I reduced the fox dataset to 3 images and yet it still fails. Is it absolutely necessary to load all images at once in uncompressed form in vram? Maybe at least doing it in chunks might fix the issue…

The same error occurred on the GTX1650 with 4GB. I reduced the number of images to two, but still reported this error.

17:07:49 SUCCESS Initialized OpenGL version 4.6.0 NVIDIA 516.94
17:07:57 INFO Loading NeRF dataset from
17:07:57 INFO D:\application\Instant-NGP-for-GTX-1000\Instant-NGP-for-GTX-1000\data\s2\transforms.json
17:07:57 SUCCESS Loaded 2 images after 0s
17:07:57 INFO cam_aabb=[min=[1.70659,0.570126,0.484637], max=[1.87223,0.88163,0.532478]]
17:07:57 INFO Loading network config from: .\configs\nerf\base.json
17:07:57 INFO GridEncoding: Nmin=16 b=3.28134 F=4 T=2^19 L=8
Warning: FullyFusedMLP is not supported for the selected architecture 61. Falling back to CutlassMLP. For maximum performance, raise the target GPU architecture to 75+.
Warning: FullyFusedMLP is not supported for the selected architecture 61. Falling back to CutlassMLP. For maximum performance, raise the target GPU architecture to 75+.
Warning: FullyFusedMLP is not supported for the selected architecture 61. Falling back to CutlassMLP. For maximum performance, raise the target GPU architecture to 75+.
17:07:57 INFO Density model: 3—[HashGrid]—>32—[FullyFusedMLP(neurons=64,layers=3)]—>1
17:07:57 INFO Color model: 3—[Composite]—>16+16—[FullyFusedMLP(neurons=64,layers=4)]—>3
17:07:57 INFO total_encoding_params=13194816 total_network_params=10240
Attempting to free memory arena while it is still in use.
Could not free memory arena: D:/a/instant-ngp/instant-ngp/dependencies/tiny-cuda-nn/include\tiny-cuda-nn/gpu_memory.h:489 cuMemAddressFree(m_base_address, m_max_size) failed with error CUDA_ERROR_INVALID_VALUE
17:07:58 ERROR Uncaught exception: D:/a/instant-ngp/instant-ngp/dependencies/tiny-cuda-nn/include\tiny-cuda-nn/gpu_memory.h:592 cuMemSetAccess(m_base_address + m_size, n_bytes_to_allocate, &access_desc, 1) failed with error CUDA_ERROR_OUT_OF_MEMORY

You must be logged in to vote

0 replies

Источник

CAN’T FIND NONCE WITH DEVICE CUDA_ERROR_LAUNCH_FAILED

PHOENIXMINER CONNECTION TO API SERVER FAILED — ЧТО ДЕЛАТЬ?

PHOENIXMINER CUDA ERROR IN CUDAPROGRAM.CU:474 : THE LAUNCH TIMED OUT AND WAS TERMINATED (702)

UNABLE TO ENUM CUDA GPUS: INVALID DEVICE ORDINAL

UNABLE TO ENUM CUDA GPUS: INSUFFICIENT CUDA DRIVER: 5000

NBMINER MINING PROGRAM UNEXPECTED EXIT.CODE: -1073740791, REASON: PROCESS CRASHED

NBMINER CUDA ERROR: OUT OF MEMORY (ERR_NO=2) — как исправить?

GMINER ERROR ON GPU: OUT OF MEMORY STOPPED MINING ON GPU0

Socket error. the remote host closed the connection, в майнере Nbminer

Server not responded on share, на майнере Gminer

DAG has been damaged check overclocking settings, в майнере Gminer

ERROR: Can’t start T-Rex, failed to initialize device map: can’t get busid, code -6

TREX: Can’t unlock GPU

CAN’T START MINER, FAILED TO INITIALIZE DEVIS MAP, CAN’T GET BUSID, CODE -6

ОШИБКА 511 ГРАДУСОВ НА ВИДЕОКАРТА

GPU driver error, no temps в HiveOS — что делать?

GPU are lost, rebooting

exitcode=3 в HiveOS

exitcode=1 в HiveOS

gpu fault detected 146

Waiting interface to come up — не работает VPN на HiveOS

Как узнать ip адрес воркера hive os

Repository update failed в HiveOS

Rave os не запускается. Boot aborted Rave os

Failed to allocate memory Raveos

gpu_driver_fault, GPU #0 fault в RaveOS

gpu_driver_fault, GPU #0 fault в RaveOS

Gpu driver fault. All tasks have been stopped. Worker will be rebooted after 5 minutes в RaveOS

Miner restarted after error RaveOS

Miner restart limit reached. Worker rebooting by flag auto в RaveOS

Miner cannot be started, ОС RaveOS

Overclock can’t be applied в RaveOS

Error installing hive miners

Warning: Nvidia settings applied with errors

Nvtool error или Danger: nvtool error

Перестал отображаться кулер видеокарты HiveOS

ERROR: parsing JSON failed

NVML: can’t get fan speed for GPU #5, error code 999 hive os

Can’t get power for GPU #2

GPU1 search error: unspecified launch failure

Warning: Autofan: unable to set fan speed, rebooting

Ошибка Kernel-Power код 41

Danger: hive-replace -y —stable (failed, exitcode=137)

MALFUNCTION HIVEOS

What Causes the CUDA Out of Memory Error?

1. Model Size

2. Batch Size

3. GPU Memory Leaks

How to Fix the CUDA Out of Memory Error?

1. Reduce the Model Size

2. Reduce the Batch Size

3. Use Mixed Precision

4. Use Gradient Checkpointing

5. Fix GPU Memory Leaks

Conclusion

About Saturn Cloud

Why you encounter this error?

What are the Causes of the Error?

How to Solve the Error?

Other Solutions to Resolved the Error

Release Cache

FAQs

Additional Resources

Conclusion

The same error occurred on the GTX1650 with 4GB. I reduced the number of images to two, but still reported this error.

Интересное по теме: