Diagnosis and replacement of a defective hard drive on Linux Dedicated and Bare Metal Servers

If you receive a notification about a hard drive error or notice irregularities in the system, quick action is required to restore the redundancy of your RAID array. This article explains how to identify a defective hard drive on a Linux Dedicated Server or a Linux Bare Metal Server with software RAID and how to prepare the server for replacing the defective drive.

Note

This article assumes basic knowledge of server administration with Linux. If you have questions about replacing a defective hard drive or need support, please contact IONOS Customer Support. You can find the contact information on the following page: IONOS Customer Support

To ensure the highest possible reliability, it is necessary that you monitor your server's software RAID. If you receive a notification email about a defective hard drive or notice a hard drive defect yourself, you must identify the defective hard drive and prepare the server for replacing the defective drive. Then, contact IONOS Customer Support to initiate the hard drive replacement.

Please Note

RAID systems provide greater reliability and/or higher speed. However, they are no substitute for regular backups. To prevent data loss, we recommend creating regular backups. Furthermore, make sure to create a backup before executing the steps listed below to ensure the security of your data.

Further information on creating backups can be found in the following category of the IONOS Help Center: Backup Solutions

Check the status of the software RAID

Establish an SSH connection to your server and log in with your root account. Instructions for this can be found in the following articles:
Setting up an SSH connection to your Linux server from a Microsoft Windows computer
Setting up an SSH connection to your Linux server from a Linux computer
To check the status of the software RAID, enter the following command in the shell:
cat /proc/mdstat

Interpretation of the output

Intact RAID: A functioning RAID is indicated by the status [UU] (for RAID 1) or [UUUU] (for RAID 5/6 with 4 hard drives). Each "U" stands for "Up" (active).

Defective/Missing Device: A [_U] or [U_] indicates that a hard drive is missing or out of sync.

Faulty Marker: An (F) after a device (e.g., sdb1[2](F)) means that the system has already marked the drive as defective ("faulty") in software.

In configurations with 2 SSDs (operating system) and additional HDDs (data), you will see multiple md devices:

md1, md2, md127, or similar (often RAID 1 for boot/root)
md11 or similar (often RAID 5 or 6 for data)

Check all sections of the output for missing Us or (F) markers. Here is an example of an intact RAID:

[root@localhost ~]# cat /proc/mdstat

Personalities : [raid1]
read_ahead 1024 sectors
md2 : active raid1 sda3[1] sdb3[0]
262016 blocks [2/2] [UU]

md1 : active raid1 sda2[1] sdb2[0]
119684160 blocks [2/2] [UU]

md0 : active raid1 sda1[1] sdb1[0]
102208 blocks [2/2] [UU]

unused devices: <none>

The example above shows three Multiple Devices or RAID arrays (md0, md1, md2). For each of these logical drives, it is specified which partitions it consists of and on which drives these partitions are located. The logical drive md0 consists of the partitions sda1 and sdb1. In the line below the respective logical drive, the state of the individual partitions is shown in square brackets at the end of the line.

In the following example, only one partition is integrated into all logical drives, which is located on the hard drive sda. The respective partition located on the second hard drive sdb is not integrated. You can also recognise this by the entry [_U]. The unmounted partitions of the hard drive sdb indicate that there is an error or a defect with this hard drive.

[root@localhost ~]# cat /proc/mdstat

Personalities : [raid1]
read_ahead 1024 sectors
md0 : active raid1 sda1[1]
102208 blocks [2/1] [_U]

md1 : active raid1 sda2[1]
119684160 blocks [2/1] [_U]

md2 : active raid1 sda3[1]
262016 blocks [2/1] [_U]

unused devices: <none>

In the following example, a defective hard drive is still integrated into the RAID. This can be recognised by the information (F) displayed for md1.

[root@localhost ~]# cat /proc/mdstat

Personalities : [raid1]
md3 : active raid1 sda3[0] sdb3[2](F)
439553856 blocks super 1.0 [2/1] [U_]
bitmap: 1/4 pages [4KB], 65536KB chunk

md1 : active raid1 sdb1[2](F) sda1[0]
19529600 blocks super 1.0 [2/1] [U_]

unused devices: <none>

Diagnosis of hard drive errors

To detect hard drive errors, complete the following:

Install the smartctl program. Smartctl is a command-line utility used to monitor drives using SMART (Self-Monitoring, Analysis and Reporting Technology). You can use this program to check if a hard drive is defective. It is part of the Smartmontools. Smartmontools are available as packages for many Linux distributions. More information can be found on the following page: Smartmontools Packages

Note

In some cases, a hard drive defect might not be detected using the SMART values. Therefore, we recommend additionally analysing the log file /var/log/messages.

Install smartctl

To install smartctl, log in to the server as an administrator.
Install the required packages depending on your distribution:
AlmaLinux 9 and 10, and Rocky Linux 9 and 10:
dnf install smartmontools
Debian and Ubuntu:
sudo apt-get install smartmontools

Retrieving hard drive information

To retrieve a list of hard drives, enter the following command:

smartctl --scan

Example:

[root@localhost ~]# smartctl --scan

/dev/sda -d scsi # /dev/sda, SCSI device
/dev/sdb -d scsi # /dev/sdb, SCSI device

To retrieve detailed information for error diagnosis, enter the following command:

smartctl -iHAl error [HARD_DRIVE_NAME]

Note

Please note that the device interfaces must be specified in the following format:

SCSI / SATA devices:

smartctl -iHAl error /dev/sd[a-z]

Example:

[root@localhost ~] # smartctl -iHAl error /dev/sda

After entering the command, information similar to the following will be displayed:

[root@localhost ~]# smartctl -iHAl error /dev/sda
smartctl 6.5 2016-05-07 r4318 [x86_64-linux-3.10.0-862.14.4.el7.x86_64] (local build)
Copyright (C) 2002–16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Device Model:     HGST HUS722T1TALA604
Serial Number:    WMC6N0K2RW66
LU WWN Device Id: 5 0014ee 004722db0
Firmware Version: RAGNWA07
User Capacity:    1,000,204,886,016 bytes [1.00 TB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    7200 rpm
Form Factor:      3.5 inches
Device is:        Not in the smartctl database [for details, use: -P showall]
ATA Version is:   ACS-3 T13/2161-D revision 5
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Fri 3 May 07:45:14 2019 UTC
SMART support is: Available – the device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART Attributes Data Structure revision number: 16
Vendor-specific SMART attributes with thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED     WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always      0
  3 Spin_Up_Time            0x0027   183   183   021    Pre-fail  Always      3833
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always      9
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always      0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always      0
  9 Power_On_Hours          0x0032   097   097   000    Old_age   Always      2560
 10 Spin_Retry_Count        0x0032   100   253   000    Old_age   Always      0
 11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always      0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always      9
 16 Unknown_Attribute       0x0022   000   200   000    Old_age   Always      26802171994
183 Runtime_Bad_Block       0x0032   100   100   000    Old_age   Always      0
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always      4
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always      67
194 Temperature_Celsius     0x0022   116   111   000    Old_age   Always      31
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always      0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always      0
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline     0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always      0
200 Multi_Zone_Error_Rate   0x0008   100   253   000    Old_age   Offline     0

SMART Error Log Version: 1
No Errors Logged

Interpretation of parameters and error diagnosis

Analyse the detailed information you retrieved using the smartctl -iHAl error [NAME_OF_HARD_DRIVE] command.

The first section lists information you can use to identify the hard drive:

=== START OF INFORMATION SECTION ===
Device Model:     HGST HUS722T1TALA604
Serial Number:    WMC6N0K2RW66
LU WWN Device Id: 5 0014ee 004722db0
Firmware Version: RAGNWA07
User Capacity:    1,000,204,886,016 bytes [1.00 TB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    7200 rpm
Form Factor:      3.5 inches
Device is:        Not in the smartctl database [for details, use: -P showall]
ATA Version is:   ACS-3 T13/2161-D revision 5
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Fri 3 May 07:45:14 2019 UTC
SMART support is: Available – device has SMART capability.
SMART support is: Enabled

This section displays, amongst other things, the device model and the serial number of the tested drive.

The second section evaluates the current health status of the hard drive via Smartctl. If the value displayed is not "PASSED", but instead shows e.g., "FAILED" or "UNKNOWN", you should arrange for the respective hard drive to be replaced as soon as possible.

=== START OF READ SMART DATA SECTION ===
SMART overall health self-assessment test result: PASSED

The third section details the determined SMART VALUES. Next to each current percentage value (VALUE), the worst ever measured value (WORST) and the respective threshold (THRESH) are listed. If the current percentage value or the worst ever measured value exceeds the threshold, a SMART warning is displayed in the WHEN_FAILED column (e.g., FAILING_NOW).

SMART Attributes Data Structure revision number: 16
Vendor-specific SMART attributes with thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED     WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always      0
  3 Spin_Up_Time            0x0027   183   183   021    Pre-fail  Always      3833
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always      9
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always      0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always      0
  9 Power_On_Hours          0x0032   097   097   000    Old_age   Always      2560
 10 Spin_Retry_Count        0x0032   100   253   000    Old_age   Always      0
 11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always      0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always      9
 16 Unknown_Attribute       0x0022   000   200   000    Old_age   Always      26802171994
183 Runtime_Bad_Block       0x0032   100   100   000    Old_age   Always      0
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always      4
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always      67
194 Temperature_Celsius     0x0022   116   111   000    Old_age   Always      31
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always      0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always      0
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline     0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always      0
200 Multi_Zone_Error_Rate   0x0008   100   253   000    Old_age   Offline     0

The following parameters can indicate an impending hard drive failure before a SMART warning is triggered:

Reallocated_Sector_Ct: This parameter specifies the number of sectors that have already been reallocated due to read errors. If a sector can no longer be read, written to, or checked correctly, the controller automatically assigns it a spare sector from the hard drive's reserve area. The original faulty sector is permanently marked as defective and is no longer used.

Note

A value greater than zero is not necessarily a cause for concern as long as it remains stable over a long period. However, a critical indicator of an impending hard drive failure is a steadily growing number of reallocated sectors. To detect a defect early, you should log this value regularly and request a replacement immediately if there is a continuous increase.

Current_Pending_Sector: Indicates the number of unstable sectors waiting for remapping. If a sector cannot be read and written to correctly, it initially receives the status "Current Pending Sector". The sector is not reallocated in this state because the data on the sector is unknown. Only after several unsuccessful read or write attempts is a replacement sector assigned, and the faulty sector is permanently marked as unreadable. If this value is non-zero, a hard drive failure is often imminent.

Offline_Uncorrectable: Indicates the number of uncorrectable errors during read and write access to sectors.

The last section deals with the internal hard drive error log. Errors are recorded here if the server's work orders were not processed correctly by the hard drive. If a double-digit error count (at least) is displayed in this section, you should arrange for a hard drive replacement as soon as possible.

SMART Error Log Version: 1
No errors logged

Retrieve detailed information for hard drive replacement

To arrange for the defective hard drive to be replaced, the following information is required:

Identifier of the hard drive in the RAID (e.g., sda)
Serial number
Model
Log file (optional)

Create a SMART log

To generate a full SMART log, enter the following command:

smartctl -x [HARD DRIVE NAME]

Example:

[root@localhost ~]# smartctl -x /dev/sda

If the hard drive can no longer be accessed via smartctl, you can retrieve the required information using the hdparm program. To install hdparm:

AlmaLinux 9 and 10, and Rocky Linux 9 and 10

dnf -y install hdparm

Ubuntu/Debian

sudo apt-get update
sudo apt-get install hdparm

Then, enter the following command to retrieve the information required for the hard drive replacement:

hdparm -i /dev/sda

Notes

If the SMART log was created as described above, this is sufficient information. You can then arrange for the replacement of the defective hard drive. Please contact IONOS Customer Support to do this.
If you cannot retrieve the serial number of the defective hard drive using Smartctl, you can alternatively provide customer service with the serial number(s) of the functioning hard drive(s).

Prepare the server for hard drive replacement

In the following example, it is assumed that the second hard drive (sdb) needs to be replaced. As part of the status check, the following software RAID status is displayed, for example:

[root@localhost ~]# cat /proc/mdstat

Personalities : [raid1]
md3 : active raid1 sda3[0] sdb3[2]
439553856 blocks super 1.0 [2/1] [UU]

md1 : active raid1 sdb1[2] sda1[0]
19529600 blocks super 1.0 [2/1] [UU]

unused devices: <none>

In this example, the second hard drive (sdb) is still integrated into the RAID and is therefore still in operation.

Manually mark the RAID device as “faulty” to remove it from the RAID

To remove the faulty hard drive from the RAID, mark it as “faulty”. To do this, enter the following command:

[root@localhost ~]# mdadm PATH_TO_RAID_ARRAY -f PATH_TO_HARD_DRIVE

In the examples below, the hard drives sdb3 and sdb1 are marked as faulty:

[root@localhost ~]# mdadm /dev/md3 -f /dev/sdb3
mdadm: set /dev/sdb3 faulty in /dev/md3

[root@localhost ~]# mdadm /dev/md1 -f /dev/sdb1
mdadm: set /dev/sdb1 faulty in /dev/md1

After entering the command, the RAID will have the following status (showing the (F) marker):

[root@localhost ~]# cat /proc/mdstat

Personalities : [raid1]
md3 : active raid1 sda3[0] sdb3[2](F)
439553856 blocks super 1.0 [2/1] [U_]

md1 : active raid1 sdb1[2](F) sda1[0]
19529600 blocks super 1.0 [2/1] [U_]

unused devices: <none>

Removing a partition from the RAID

To remove a partition from the RAID, enter the following command:

[root@localhost ~]# mdadm -r /PATH_TO_RAID_ARRAY /PATH_TO_HARD_DRIVE

In the examples below, the hard drives sdb3 and sdb1 are removed from the RAIDs md3 and md1:

[root@localhost ~]# mdadm -r/dev/md3 /dev/sdb3
mdadm: hot removed /dev/sdb3 from /dev/md3

[root@localhost ~]# mdadm -r /dev/md1 /dev/sdb1
mdadm: hot removed /dev/sdb1 from /dev/md1

Next, check the status of the RAID. In this example, the RAID that has been prepared for the hard disk replacement has the following final status:

[root@localhost ~]# cat /proc/mdstat

Personalities : [raid1]
md3 : active raid1 sda3[0]
439553856 blocks super 1.0 [2/1] [U_]

md1 : active raid1 sda1[0]
19529600 blocks super 1.0 [2/1] [U_]

unused devices: <none>

Check used swap partitions

Check which swap partitions are being used by the operating system. To do this, enter the following command:

[root@localhost ~]# cat /proc/swaps

Filename Type Size Used Priority
/dev/sda2 partition 9765884 0 -1
/dev/sdb2 partition 9765884 0 -2

Alternatively, you can verify which swap partitions are defined in fstab by entering the following command:

[root@localhost ~]# grep swap /etc/fstab
/dev/sda2 none swap sw
/dev/sdb2 none swap sw

Deactivate the swap partition on the defective device

Deactivate the swap partition on the defective hard drive so that it can be safely replaced. To do this, enter the following command:

[root@localhost ~]# swapoff PATH_TO_PARTITION

Example:

[root@localhost ~]# swapoff /dev/sdb2

Note

If the swap partition on the faulty hard disk is not deactivated and the hard disk is replaced, the swap partition in /proc/swaps will be marked as ‘deleted’.

Arrange for a hard drive replacement

You can now arrange for the faulty hard drive to be replaced. To do so, please contact IONOS Customer Support. You will find the contact details on the following page: IONOS Customer Support

Required steps after hard drive replacement

After the defective hard drive has been replaced, it is necessary to rebuild the software RAID. Further information on rebuilding a software RAID can be found in the following article: Rebuilding a software RAID (Linux)

Content

Check the status of the software RAID
Interpretation of the output
Diagnosis of hard drive errors
Interpretation of parameters and error diagnosis
Retrieve detailed information for hard drive replacement
Create a SMART log
Prepare the server for hard drive replacement
Arrange for a hard drive replacement
Required steps after hard drive replacement
To top