Diagnosis and replacement of a defective hard drive on Linux Dedicated and Bare Metal Servers
Please use the “Print” function at the bottom of the page to create a PDF.
If you receive a notification about a hard drive error or notice irregularities in the system, quick action is required to restore the redundancy of your RAID array. This article explains how to identify a defective hard drive on a Linux Dedicated Server or a Linux Bare Metal Server with software RAID and how to prepare the server for replacing the defective drive.
Note
This article assumes basic knowledge of server administration with Linux. If you have questions about replacing a defective hard drive or need support, please contact IONOS Customer Support. You can find the contact information on the following page: IONOS Customer Support
To ensure the highest possible reliability, it is necessary that you monitor your server's software RAID. If you receive a notification email about a defective hard drive or notice a hard drive defect yourself, you must identify the defective hard drive and prepare the server for replacing the defective drive. Then, contact IONOS Customer Support to initiate the hard drive replacement.
Please Note
RAID systems provide greater reliability and/or higher speed. However, they are no substitute for regular backups. To prevent data loss, we recommend creating regular backups. Furthermore, make sure to create a backup before executing the steps listed below to ensure the security of your data.
Further information on creating backups can be found in the following category of the IONOS Help Center: Backup Solutions
Check the status of the software RAID
Establish an SSH connection to your server and log in with your root account. Instructions for this can be found in the following articles:
Setting up an SSH connection to your Linux server from a Microsoft Windows computer
Setting up an SSH connection to your Linux server from a Linux computerTo check the status of the software RAID, enter the following command in the shell:
cat /proc/mdstat
Interpretation of the output
Intact RAID: A functioning RAID is indicated by the status [UU] (for RAID 1) or [UUUU] (for RAID 5/6 with 4 hard drives). Each "U" stands for "Up" (active).
Defective/Missing Device: A [_U] or [U_] indicates that a hard drive is missing or out of sync.
Faulty Marker: An (F) after a device (e.g., sdb1[2](F)) means that the system has already marked the drive as defective ("faulty") in software.
In configurations with 2 SSDs (operating system) and additional HDDs (data), you will see multiple md devices:
- md1, md2, md127, or similar (often RAID 1 for boot/root)
- md11 or similar (often RAID 5 or 6 for data)
Check all sections of the output for missing Us or (F) markers. Here is an example of an intact RAID:
[root@localhost ~]# cat /proc/mdstat
Personalities : [raid1]
read_ahead 1024 sectors
md2 : active raid1 sda3[1] sdb3[0]
262016 blocks [2/2] [UU]
md1 : active raid1 sda2[1] sdb2[0]
119684160 blocks [2/2] [UU]
md0 : active raid1 sda1[1] sdb1[0]
102208 blocks [2/2] [UU]
unused devices: <none>
The example above shows three Multiple Devices or RAID arrays (md0, md1, md2). For each of these logical drives, it is specified which partitions it consists of and on which drives these partitions are located. The logical drive md0 consists of the partitions sda1 and sdb1. In the line below the respective logical drive, the state of the individual partitions is shown in square brackets at the end of the line.
In the following example, only one partition is integrated into all logical drives, which is located on the hard drive sda. The respective partition located on the second hard drive sdb is not integrated. You can also recognise this by the entry [_U]. The unmounted partitions of the hard drive sdb indicate that there is an error or a defect with this hard drive.
[root@localhost ~]# cat /proc/mdstat
Personalities : [raid1]
read_ahead 1024 sectors
md0 : active raid1 sda1[1]
102208 blocks [2/1] [_U]
md1 : active raid1 sda2[1]
119684160 blocks [2/1] [_U]
md2 : active raid1 sda3[1]
262016 blocks [2/1] [_U]
unused devices: <none>
In the following example, a defective hard drive is still integrated into the RAID. This can be recognised by the information (F) displayed for md1.
[root@localhost ~]# cat /proc/mdstat
Personalities : [raid1]
md3 : active raid1 sda3[0] sdb3[2](F)
439553856 blocks super 1.0 [2/1] [U_]
bitmap: 1/4 pages [4KB], 65536KB chunk
md1 : active raid1 sdb1[2](F) sda1[0]
19529600 blocks super 1.0 [2/1] [U_]
unused devices: <none>
Diagnosis of hard drive errors
To detect hard drive errors, complete the following:
Install the smartctl program. Smartctl is a command-line utility used to monitor drives using SMART (Self-Monitoring, Analysis and Reporting Technology). You can use this program to check if a hard drive is defective. It is part of the Smartmontools. Smartmontools are available as packages for many Linux distributions. More information can be found on the following page: Smartmontools Packages
Note
In some cases, a hard drive defect might not be detected using the SMART values. Therefore, we recommend additionally analysing the log file /var/log/messages.
Install smartctl
To install smartctl, log in to the server as an administrator.
Install the required packages depending on your distribution:
AlmaLinux 9 and 10, and Rocky Linux 9 and 10:
dnf install smartmontools
Debian and Ubuntu:
sudo apt-get install smartmontools
Retrieving hard drive information
To retrieve a list of hard drives, enter the following command:
smartctl --scan
Example:
[root@localhost ~]# smartctl --scan
/dev/sda -d scsi # /dev/sda, SCSI device
/dev/sdb -d scsi # /dev/sdb, SCSI device
To retrieve detailed information for error diagnosis, enter the following command:
smartctl -iHAl error [HARD_DRIVE_NAME]
Note
Please note that the device interfaces must be specified in the following format:
SCSI / SATA devices:
smartctl -iHAl error /dev/sd[a-z]
Example:
[root@localhost ~] # smartctl -iHAl error /dev/sda
After entering the command, information similar to the following will be displayed:
[root@localhost ~]# smartctl -iHAl error /dev/sda
smartctl 6.5 2016-05-07 r4318 [x86_64-linux-3.10.0-862.14.4.el7.x86_64] (local build)
Copyright (C) 2002–16, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Device Model: HGST HUS722T1TALA604
Serial Number: WMC6N0K2RW66
LU WWN Device Id: 5 0014ee 004722db0
Firmware Version: RAGNWA07
User Capacity: 1,000,204,886,016 bytes [1.00 TB]
Sector Size: 512 bytes logical/physical
Rotation Rate: 7200 rpm
Form Factor: 3.5 inches
Device is: Not in the smartctl database [for details, use: -P showall]
ATA Version is: ACS-3 T13/2161-D revision 5
SATA Version is: SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is: Fri 3 May 07:45:14 2019 UTC
SMART support is: Available – the device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
SMART Attributes Data Structure revision number: 16
Vendor-specific SMART attributes with thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always 0
3 Spin_Up_Time 0x0027 183 183 021 Pre-fail Always 3833
4 Start_Stop_Count 0x0032 100 100 000 Old_age Always 9
5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always 0
7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always 0
9 Power_On_Hours 0x0032 097 097 000 Old_age Always 2560
10 Spin_Retry_Count 0x0032 100 253 000 Old_age Always 0
11 Calibration_Retry_Count 0x0032 100 253 000 Old_age Always 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always 9
16 Unknown_Attribute 0x0022 000 200 000 Old_age Always 26802171994
183 Runtime_Bad_Block 0x0032 100 100 000 Old_age Always 0
192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always 4
193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always 67
194 Temperature_Celsius 0x0022 116 111 000 Old_age Always 31
196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always 0
197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always 0
198 Offline_Uncorrectable 0x0030 100 253 000 Old_age Offline 0
199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always 0
200 Multi_Zone_Error_Rate 0x0008 100 253 000 Old_age Offline 0
SMART Error Log Version: 1
No Errors Logged
Interpretation of parameters and error diagnosis
Analyse the detailed information you retrieved using the smartctl -iHAl error [NAME_OF_HARD_DRIVE] command.
The first section lists information you can use to identify the hard drive:
=== START OF INFORMATION SECTION ===
Device Model: HGST HUS722T1TALA604
Serial Number: WMC6N0K2RW66
LU WWN Device Id: 5 0014ee 004722db0
Firmware Version: RAGNWA07
User Capacity: 1,000,204,886,016 bytes [1.00 TB]
Sector Size: 512 bytes logical/physical
Rotation Rate: 7200 rpm
Form Factor: 3.5 inches
Device is: Not in the smartctl database [for details, use: -P showall]
ATA Version is: ACS-3 T13/2161-D revision 5
SATA Version is: SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is: Fri 3 May 07:45:14 2019 UTC
SMART support is: Available – device has SMART capability.
SMART support is: Enabled
This section displays, amongst other things, the device model and the serial number of the tested drive.
The second section evaluates the current health status of the hard drive via Smartctl. If the value displayed is not "PASSED", but instead shows e.g., "FAILED" or "UNKNOWN", you should arrange for the respective hard drive to be replaced as soon as possible.
=== START OF READ SMART DATA SECTION ===
SMART overall health self-assessment test result: PASSED
The third section details the determined SMART VALUES. Next to each current percentage value (VALUE), the worst ever measured value (WORST) and the respective threshold (THRESH) are listed. If the current percentage value or the worst ever measured value exceeds the threshold, a SMART warning is displayed in the WHEN_FAILED column (e.g., FAILING_NOW).
SMART Attributes Data Structure revision number: 16
Vendor-specific SMART attributes with thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always 0
3 Spin_Up_Time 0x0027 183 183 021 Pre-fail Always 3833
4 Start_Stop_Count 0x0032 100 100 000 Old_age Always 9
5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always 0
7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always 0
9 Power_On_Hours 0x0032 097 097 000 Old_age Always 2560
10 Spin_Retry_Count 0x0032 100 253 000 Old_age Always 0
11 Calibration_Retry_Count 0x0032 100 253 000 Old_age Always 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always 9
16 Unknown_Attribute 0x0022 000 200 000 Old_age Always 26802171994
183 Runtime_Bad_Block 0x0032 100 100 000 Old_age Always 0
192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always 4
193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always 67
194 Temperature_Celsius 0x0022 116 111 000 Old_age Always 31
196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always 0
197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always 0
198 Offline_Uncorrectable 0x0030 100 253 000 Old_age Offline 0
199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always 0
200 Multi_Zone_Error_Rate 0x0008 100 253 000 Old_age Offline 0
The following parameters can indicate an impending hard drive failure before a SMART warning is triggered:
Reallocated_Sector_Ct: This parameter specifies the number of sectors that have already been reallocated due to read errors. If a sector can no longer be read, written to, or checked correctly, the controller automatically assigns it a spare sector from the hard drive's reserve area. The original faulty sector is permanently marked as defective and is no longer used.
Note
A value greater than zero is not necessarily a cause for concern as long as it remains stable over a long period. However, a critical indicator of an impending hard drive failure is a steadily growing number of reallocated sectors. To detect a defect early, you should log this value regularly and request a replacement immediately if there is a continuous increase.
Current_Pending_Sector: Indicates the number of unstable sectors waiting for remapping. If a sector cannot be read and written to correctly, it initially receives the status "Current Pending Sector". The sector is not reallocated in this state because the data on the sector is unknown. Only after several unsuccessful read or write attempts is a replacement sector assigned, and the faulty sector is permanently marked as unreadable. If this value is non-zero, a hard drive failure is often imminent.
Offline_Uncorrectable: Indicates the number of uncorrectable errors during read and write access to sectors.
The last section deals with the internal hard drive error log. Errors are recorded here if the server's work orders were not processed correctly by the hard drive. If a double-digit error count (at least) is displayed in this section, you should arrange for a hard drive replacement as soon as possible.
SMART Error Log Version: 1
No errors logged
Retrieve detailed information for hard drive replacement
To arrange for the defective hard drive to be replaced, the following information is required:
Identifier of the hard drive in the RAID (e.g., sda)
Serial number
Model
Log file (optional)
Create a SMART log
To generate a full SMART log, enter the following command:
smartctl -x [HARD DRIVE NAME]
Example:
[root@localhost ~]# smartctl -x /dev/sda
If the hard drive can no longer be accessed via smartctl, you can retrieve the required information using the hdparm program. To install hdparm:
AlmaLinux 9 and 10, and Rocky Linux 9 and 10
dnf -y install hdparm
Ubuntu/Debian
sudo apt-get update
sudo apt-get install hdparm
Then, enter the following command to retrieve the information required for the hard drive replacement:
hdparm -i /dev/sda
Notes
If the SMART log was created as described above, this is sufficient information. You can then arrange for the replacement of the defective hard drive. Please contact IONOS Customer Support to do this.
If you cannot retrieve the serial number of the defective hard drive using Smartctl, you can alternatively provide customer service with the serial number(s) of the functioning hard drive(s).
Prepare the server for hard drive replacement
In the following example, it is assumed that the second hard drive (sdb) needs to be replaced. As part of the status check, the following software RAID status is displayed, for example:
[root@localhost ~]# cat /proc/mdstat
Personalities : [raid1]
md3 : active raid1 sda3[0] sdb3[2]
439553856 blocks super 1.0 [2/1] [UU]
md1 : active raid1 sdb1[2] sda1[0]
19529600 blocks super 1.0 [2/1] [UU]
unused devices: <none>
In this example, the second hard drive (sdb) is still integrated into the RAID and is therefore still in operation.
Manually mark the RAID device as “faulty” to remove it from the RAID
To remove the faulty hard drive from the RAID, mark it as “faulty”. To do this, enter the following command:
[root@localhost ~]# mdadm PATH_TO_RAID_ARRAY -f PATH_TO_HARD_DRIVE
In the examples below, the hard drives sdb3 and sdb1 are marked as faulty:
[root@localhost ~]# mdadm /dev/md3 -f /dev/sdb3
mdadm: set /dev/sdb3 faulty in /dev/md3
[root@localhost ~]# mdadm /dev/md1 -f /dev/sdb1
mdadm: set /dev/sdb1 faulty in /dev/md1
After entering the command, the RAID will have the following status (showing the (F) marker):
[root@localhost ~]# cat /proc/mdstat
Personalities : [raid1]
md3 : active raid1 sda3[0] sdb3[2](F)
439553856 blocks super 1.0 [2/1] [U_]
md1 : active raid1 sdb1[2](F) sda1[0]
19529600 blocks super 1.0 [2/1] [U_]
unused devices: <none>
Removing a partition from the RAID
To remove a partition from the RAID, enter the following command:
[root@localhost ~]# mdadm -r /PATH_TO_RAID_ARRAY /PATH_TO_HARD_DRIVE
In the examples below, the hard drives sdb3 and sdb1 are removed from the RAIDs md3 and md1:
[root@localhost ~]# mdadm -r/dev/md3 /dev/sdb3
mdadm: hot removed /dev/sdb3 from /dev/md3
[root@localhost ~]# mdadm -r /dev/md1 /dev/sdb1
mdadm: hot removed /dev/sdb1 from /dev/md1
Next, check the status of the RAID. In this example, the RAID that has been prepared for the hard disk replacement has the following final status:
[root@localhost ~]# cat /proc/mdstat
Personalities : [raid1]
md3 : active raid1 sda3[0]
439553856 blocks super 1.0 [2/1] [U_]
md1 : active raid1 sda1[0]
19529600 blocks super 1.0 [2/1] [U_]
unused devices: <none>
Check used swap partitions
Check which swap partitions are being used by the operating system. To do this, enter the following command:
[root@localhost ~]# cat /proc/swaps
Filename Type Size Used Priority
/dev/sda2 partition 9765884 0 -1
/dev/sdb2 partition 9765884 0 -2
Alternatively, you can verify which swap partitions are defined in fstab by entering the following command:
[root@localhost ~]# grep swap /etc/fstab
/dev/sda2 none swap sw
/dev/sdb2 none swap sw
Deactivate the swap partition on the defective device
Deactivate the swap partition on the defective hard drive so that it can be safely replaced. To do this, enter the following command:
[root@localhost ~]# swapoff PATH_TO_PARTITION
Example:
[root@localhost ~]# swapoff /dev/sdb2
Note
If the swap partition on the faulty hard disk is not deactivated and the hard disk is replaced, the swap partition in /proc/swaps will be marked as ‘deleted’.
Arrange for a hard drive replacement
You can now arrange for the faulty hard drive to be replaced. To do so, please contact IONOS Customer Support. You will find the contact details on the following page: IONOS Customer Support
Required steps after hard drive replacement
After the defective hard drive has been replaced, it is necessary to rebuild the software RAID. Further information on rebuilding a software RAID can be found in the following article: Rebuilding a software RAID (Linux)
Content
- Check the status of the software RAID
- Interpretation of the output
- Diagnosis of hard drive errors
- Interpretation of parameters and error diagnosis
- Retrieve detailed information for hard drive replacement
- Create a SMART log
- Prepare the server for hard drive replacement
- Arrange for a hard drive replacement
- Required steps after hard drive replacement
- To top