Proxmox проверка диска на ошибки

Although a robust and redundant storage is recommended,
it can be very helpful to monitor the health of your local disks.

Starting with Proxmox VE 4.3, the package smartmontools
[smartmontools homepage https://www.smartmontools.org]

is installed and required. This is a set of tools to monitor and control
the S.M.A.R.T. system for local hard disks.

You can get the status of a disk by issuing the following command:

where /dev/sdX is the path to one of your local disks.

If the output says:

SMART support is: Disabled

you can enable it with the command:

# smartctl -s on /dev/sdX

For more information on how to use smartctl, please see man smartctl.

By default, smartmontools daemon smartd is active and enabled, and scans
the disks under /dev/sdX and /dev/hdX every 30 minutes for errors and warnings, and sends an
e-mail to root if it detects a problem.

For more information about how to configure smartd, please see man smartd and
man smartd.conf.

If you use your hard disks with a hardware raid controller, there are most likely tools
to monitor the disks in the raid array and the array itself. For more information about this,
please refer to the vendor of your raid controller.

Проверка диска на наличие плохих секторов возникает нежданно и лучше знать как это сделать имея под рукой всё необходимое. Вариантов проверки дисков множество. Расскажу о проверке средствами консоли Linux. Просто и ничего лишнего.

Содержание:

  • 1 Причины для проверки диска
  • 2 Определение диска для проверки
  • 3 Проверка диска на битые секторы
    • 3.1 Создание файла для записи плохих секторов
    • 3.2 Проверка диска утилитой badblocks
    • 3.3 Пометка плохих секторов диска
  • 4 Подготовка диска для проверки
  • 5 Вывод

Причины для проверки диска

Основная причина проверки это как правило медленная работа системы или зависание при определенных действиях. Вывести диск из строя могут разные факторы. Вот некоторые из них:

  • Время жизни диска не вечна;
  • Некорректные выключения системы при пропадании питания;
  • Физические удары;
  • Запуск холодного диска зимой.

Самое лучшее это периодически проверять диск просто так. На ранней стадии обнаружения гораздо больше шансов сохранить важные данные.

Храните важные данные в двух совершенно разных физически местах. Только такой подход гарантирует вам полную сохранность данных.

Определение диска для проверки

Для того чтобы понять какой диск проверять нам достаточно ввести команду в консоли которая выдаст нам список всех имеющихся дисков в системе.

fdisk -l
= вывод части команды =
Диск /dev/sda: 232.9 GiB, 250059350016 байт, 488397168 секторов
Единицы: секторов по 1 * 512 = 512 байт
Размер сектора (логический/физический): 512 байт / 512 байт
Размер I/O (минимальный/оптимальный): 512 байт / 512 байт
Тип метки диска: dos
Идентификатор диска: 0x42ef42ef

Устр-во Загрузочный начало Конец Секторы Размер Идентификатор Тип
/dev/sda1 * 2048 184322047 184320000    87.9G 7 HPFS/NTFS/exFAT
/dev/sda2 184322048 488394751 304072704 145G  7 HPFS/NTFS/exFAT

Мы видим в выводе диск который нам надо проверить. Диск имеет 2 раздела с данными.

Проверка диска на битые секторы

Перед проверкой разделы необходимо отмонтировать. Как правило я загружаю операционную систему на базе Linux c Live образа или использую подготовленный PXE сервер на котором присутствуют и другие программы для проверки жестких дисков.

Можно сразу запустить проверку с исправлением, но мне кажется это неправильно. Гораздо логичней вначале проверить диск и собрать информацию обо всех битых секторах и только после этого принять решение о дальнейшей судьбе диска.

При появлении хотя бы нескольких плохих секторов я больше диск не использую. Пометку плохих секторов с попыткой забрать из них информацию использую только для того чтобы сохранить данные на другой диск.

Создание файла для записи плохих секторов

Создадим файл указав для удобства имя проверяемого раздела.

touch "/root/bad-sda1.list"

Проверка диска утилитой badblocks

Запустим проверку с информацией о ходе процесса с подробным выводом. Чем больше диск тем дольше проверка!

badblocks -sv /dev/sda1 > /root/bad-sda1.list
= Информация о ходе процесса = 
badblocks -sv /dev/sda1 > /root/bad-sda1.list Checking blocks 0 to 976761542 Checking for bad blocks (read-only test): 0.91% done, 1:43 elapsed. (0/0/0 errors)
= Подробный вывод результата =
badblocks -sv /dev/sda1 > /root/bad-sda1.list Checking blocks 0 to 156289862 Checking for bad blocks (read-only test): done             Pass completed, 8 bad blocks found. (8/0/0 errors)

В нашем случае диск с 8 плохими секторами.

Пометка плохих секторов диска

Запустим утилиту e2fsck, указав ей список битых секторов. Программа пометит плохие сектора  и попытается восстановить данные.

Указывать формат файловой системы нет надобности. Утилита сделает всё сама.

e2fsck -l /root/bad-sda1.list /dev/sda1
= Вывод команды =
e2fsck -l /root/bad-sda1.list /dev/sda1e2fsck 1.43.3 (04-Sep-2016)
Bad block 44661688 out of range; ignored.
Bad block 44661689 out of range; ignored.
Bad block 44661690 out of range; ignored.
Bad block 44911919 out of range; ignored.
Bad block 44958212 out of range; ignored.
Bad block 44958213 out of range; ignored.
Bad block 44958214 out of range; ignored.
Bad block 44958215 out of range; ignored.
/dev/sda1: Updating bad block inode.
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
/dev/sda1: ***** FILE SYSTEM WAS MODIFIED *****
/dev/sda1: 11/9773056 files (0.0% non-contiguous), 891013/39072465 blocks

Подготовка диска для проверки

Бывают случаи когда таблица разделов повреждена на диске и нет возможности посмотреть какие есть разделы с данными. Возможно вам не надо никаких данных на диске и вы хотите диск отформатировать и затем проверить. Случаи бывают разные и надо подходить исходя из ситуации.

С помощью команды с ключом -z вы сможете создать заново таблицу разделов и создать необходимые вам разделы.

cfdisk -z /dev/sda

Как работать с утилитой cfdisk я не буду объяснять, так как это выходит за рамки данной статьи.

Предположим что вы создали из всего диска лишь один раздел /dev/sda1. Для форматирования его в ext4 достаточно выполнить команду:

mkfs.ext4 /dev/sda1
= Вывод команды =
mke2fs 1.43.3 (04-Sep-2016)Creating filesystem with 244190385 4k blocks and 61054976 inodesFilesystem UUID: c4a1eeed-960a-4aea-a5ff-02ce93bf0a2eSuperblock backups stored on blocks:  32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208,  4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968,  102400000, 214990848
Allocating group tables: done                            
Writing inode tables: done                           
Creating journal (262144 blocks): doneWriting superblocks and filesystem accounting information: done

Вывод

Проще и понятней механизма проверки диска на битые сектора как в системе Linux я не встречал. Ничего лишнего только суть. Выбор варианта как проверять и когда всегда за вами. После того как я один раз потерял важные данные храню всё важное в 3 разных местах.

Пожалуйста, оставляйте свои комментарии

Читая их я получаю информацию которая позволяет мне улучшить качество написания статей. Кроме того, оставляя комментарии вы помогаете сайту получить более высокий рейтинг у поисковых систем. Давайте общаться.

Search code, repositories, users, issues, pull requests…

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sign up

The aim of this article is to configure SMART monitoring in Proxmox and to send emails in the case anything untoward is found. You should already have the email system set up as per this earlier article. This article assumes that you have disks attached directly to the system rather than via a RAID controller. There are probably other ways to monitor the disks if they are connected via a RAID controller.

Checking Things Out

Fortunately for us all most Proxmox installs should come with all of the monitoring tools we need out of the box. As the user guide states, since Proxmox version 4.3 (Sept 2016) it has shipped with smartmontools, a comprehensive SMART monitoring utility. As mentioned in this earlier article inspecting disks with smartmontools manually is easy but who wants to do a job like that manually? An understanding of what the smartmontools utilities can do will help with this task so it’s well worth having a read of the manuals linked at the bottom of this article.

By default the smartmontools daemon, smartd, is running and polls the disks automatically every 30 minutes. The poll time is configurable, as noted in the manual, but I see no good reason to change it. If you do want to change it the setting is in /etc/default/smartmiontools, just uncomment and edit the smartd_opts line. These options will be passed through to the init script that starts the daemon.

To check that smartd is running you can list and filter active processes like this:

# ps aux | grep smart
root      251302  0.0  0.0  11996  6564 ?        Ss   May19   0:00 /usr/sbin/smartd -n
root      770262  0.0  0.0   6244   648 pts/0    S+   10:32   0:00 grep smart

The first line shows that smartd is running. The other columns show that the process is owned by root, has a PID of 251302, is using not CPU or memory (well, very little) and various other things. The Ss means that the process is in an interruptible sleep and that it’s the session leader – in other words it wasn’t running when I looked at it.

Better than that though you can ask systemctl about the status of the service. Typical output would look something like this:

# systemctl status smartd

● smartmontools.service - Self Monitoring and Reporting Technology (SMART) Daemon
     Loaded: loaded (/lib/systemd/system/smartmontools.service; enabled; vendor preset: enabled)
     Active: active (running) since Fri 2023-05-19 17:54:57 BST; 4 days ago
       Docs: man:smartd(8)
             man:smartd.conf(5)
   Main PID: 251302 (smartd)
     Status: "Next check of 6 devices will start at 18:24:57"
      Tasks: 1 (limit: 33474)
     Memory: 2.2M
        CPU: 587ms
     CGroup: /system.slice/smartmontools.service
             └─251302 /usr/sbin/smartd -n

May 23 13:55:24 xxx smartd[251302]: Device: /dev/sde [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 68 to 67
May 23 13:55:24 xxx smartd[251302]: Device: /dev/sde [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 32 to 33
May 23 14:25:02 xxx smartd[251302]: Device: /dev/sda [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 66 to 65
May 23 14:25:02 xxx smartd[251302]: Device: /dev/sda [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 34 to 35
May 23 14:25:13 xxx smartd[251302]: Device: /dev/sdc [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 63 to 64
May 23 14:25:13 xxx smartd[251302]: Device: /dev/sdc [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 37 to 36
May 23 14:25:18 xxx smartd[251302]: Device: /dev/sdd [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 67 to 68
May 23 14:25:18 xxx smartd[251302]: Device: /dev/sdd [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 33 to 32
May 23 17:55:08 xxx smartd[251302]: Device: /dev/sdb [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 62 to 63
May 23 17:55:08 xxx smartd[251302]: Device: /dev/sdb [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 38 to 37

As you can see we get some information about when the service was started, when it’ll next check the drives and the last few log messages it created. Notice that the long messages are all about temperature and not very interesting, we’ll fix that when we create a more complete configuration.

Out of the Box Configuration of smartd

The configuration for smartd can be found in the /etc/smartd.conf file. The default settings, once all the comments have been removed, are extremely simple:

DEVICESCAN -d removable -n standby -m root -M exec /usr/share/smartmontools/smartd-runner

The DEVICESCAN setting tells smartd to scan for drives to monitor rather than expect a list of drives. The comments in the conf file mention that normally systems should list all the drives to monitor manually but from what I can find online most people seem to scan. This setting can take a number of directives, the current settings are described below.

  • -d removable –> ignore drives that are missing at start up. To be honest I’ve not found a really good explanation of this directive and whether it’s safe to use with drives that wouldn’t usually be removed. It seems it’s more aimed at the situation where you enumerate drives manually. In that situation you wouldn’t want smartd complaining or exiting because you unplugged a USB drive. In a mixed system with some removable and some non-removable drives it’s not clear to me what this directive does.
  • -n standby –> don’t spin up a disk to check it, let it stay in sleep or standby. In other words, let sleeping disks lie.
  • -m root –> if a problem is detected send an email too root, this can also be one or more full email addresses
  • -M exec /usr/share/smartmontools/smartd-runner –> rather than run the mail command run the specified script.

The smartd-runner script executes all the scripts that are found in the /etc/smartmontools/run.d directory. With a default install this is a single script that just calls the mail utility.

Configuring smartd Better

The default configuration isn’t terrible but it also leaves a bit to be desired. After much documentation reading and searching I cam across this post that provides a better starting point (also mentioned here linking to this). The configuration I’ll be using which is based on the suggestion is shown below.

DEVICESCAN -H -f -u -p -l error -l selftest -n standby,24,q \
-I 194 \
-I 190 \
-W 5,45,50 \
-i 9 \
-R 5! \
-C 197 \
-U 198 \
-o on -S on -s (S/../.././02|L/../01/./04) \
-m root \
-M test \ 
-M exec /usr/share/smartmontools/smartd-runner

A quick rundown of the directives is given below, for full details see the man page of smartd.conf.

  • -H: (ATA only) Check the SMART health status and log a critical error if any pre-failure has reached or passed it’s threshold value. If this check has been triggered a failure is imminent.
  • -f: (ATA only) Checks for usage attribute failures e.g. the drive is too warm or is past it’s design life. These don’t indicate a failure is imminent but are warning signs one might be around the corner.
  • -u: [ATA only] Report changes in usage attributes.
  • -p: [ATA only] Report changes in prefailure attributes.
  • -l: (lower case L) Looks for increases in the SMART logs. The error argument examines the summary log and the smarttest argument checks the self-test log. The latter option only makes sense if you are running regular self tests with the -s option.
  • -n: The standby argument tells smartd to not wake the disks if they aren’t spinning. The value 24 indicates that the disks should be woken up if they have missed 24 tests. The q indicates that log messages shouldn’t be left for skipped tests.
  • -I: (upper case i) 190 ignores airflow temperature changes, 194 ignores temperature changes. This cut down the number of log message for things that are likely to change very frequently and have no diagnostic worth.
  • -W: Track temperature changes. The first value looks for a change of more than that amount between reports, the next two values specify info and critical report thresholds.
  • -i: Ignore an attribute altogether. In this case attribute 9 is specified which is power on hours. This prevents emails being sent for old disks that otherwise seem to be working fine.
  • -R: Report all changes in raw value for the given attribute. In this case 5 is specified which is reallocated sector count, a key indication a drive is failing. The exclamation logs the change as critical.
  • -C: Report if current pending sector count is non-zero. the number specifies the attribute to check, it’s usually 197, but some vendors have used other numbers in the past. This is a key indicator of a disk that may fail.
  • -U: Report if offline uncorrectable count is non-zero. The number specifies the attribute to check, it’s usually 198, but some vendors have used other numbers in the past. This is a key indicator of a disk that may fail.
  • -o: Turns on automatic SMART testing when smartd starts.
  • -S: Turns on attribute saving when smartd starts.
  • -s: Runs self-tests on the disk. The documentation is needed for this one but the setting shown runs a short test at 2AM every day and a long test at 4AM on the first of every month.
  • -m: Send email reports to all the users and email addressed listed
  • -M: modifies the behaviour of the -m directive. The test option causes smartd to send a test email for each monitored drive when the service starts. The exec option causes smartd to execute the given script rather than the built in mail command.

Restart smartd

Restarting smartd is necessary to get it to pick up its new settings. This is done by asking systemctl to restart the service.

# systemctl restart smartd

The restart can take a moment in my experience. If you’ve copied the settings above exactly you should now get a flurry of emails, one for each disk in your system.

That’s it you should now have a fully working SMART monitoring system. May your disks live long and trouble free lives.

References

  • Smartctl man page
  • Smartd.conf man page
  • Smartd man page
  • Old but still relevant guide for Proxmox

I recommend always installing SMARTMONTOOLS to any server with physical disc hard drive(s). Meaning if you have a spinning hard drive (not SSD) you will eventually have to replace it, because it will fail soon or later.

I hate to be surprised when it is too late to replace a failing hard drive. SmartMonTools stands for SMART Monitoring Tool, will query your hard drive for its health status.  If you do this daily and setup an alert system to your email, you will most likely avoid a bad surprise in the future.

I highly recommend installing and using this smartmontools monitoring and alert for any server.

Here is how I have deployed on EACH on of my server:

THIS CAN BE INSTALLED ON ANY BARE METAL SERVER, FOR PROMOX PVE, THIS MEANS YOUR HARDWARE NODE.

1. install smartmontools

aptitude update && aptitude -y install smartmontools

2. edit default daemon start configuration:

nano /etc/default/smartmontools

unremark all commented lines

enable_smart=»/dev/sda /dev/sdb /dev/sdc»

start_smartd=yes

smartd_opts=»—interval=28800″

3. edit smartd.conf  (in this example I have 3 SATA drives: sda, sdb, sdc)

nano /etc/smartd.conf

/dev/sda -d sat -a -s L/../../7/4 -m john@smith.com,jack@jill.com

/dev/sdb -d sat -a -s L/../../7/5 -m john@smith.com,jack@jill.com

/dev/sdc -d sat -a -s L/../../7/6 -m john@smith.com,jack@jill.com

The above example will do the following:

1. scan sda at 4am Saturday

2. scan sdb at 5am Saturday

3. scan sdc at 6am Saturday

Email alert will be sent to john@smith.com and jack@jill.com if there is something wrong.

NOTE about the -s parameter:

The second from the last is the DAY parameter:

            Sunday is day # 1

            Monday is day # 2

            …

            Saturday is day #7

4. restart smartmontools

/etc/init.d/smartmontools restart

5. check current HEALTH status:

smartctl -H /dev/sda
smartctl -H /dev/sdb
smartctl -H /dev/sdc

DONE!

Понравилась статья? Поделить с друзьями:
  • Ps3 ошибка при включении
  • Proxmox ошибки при установке
  • Ps0325 fanuc ошибка решение
  • Ps3 ошибка 80711008 как исправить
  • Proxmox log ошибок