I am encountering following error in /var/log/messages:
Aug 15 03:55:42 hostname smartd: Device: /dev/sda, 1 Currently unreadable (pending) sectors
Which cause the / partition to be mounted as read-only. The server is accessible anyway but you cant do anything much inside. Lets troubleshoot this.
I see read-only filesystem mounted when creating a test file in /root directory:
touch: cannot touch`/root/testfile': Read-only file system
What is SMART daemon (smartd)?
Self-Monitoring, Analysis and Reporting Technology (SMART) system built into many ATA-3 and later ATA, IDE and SCSI-3 hard drives. The purpose of SMART is to monitor the reliability of the hard drive and predict drive failures, and to carry out different types of drive self-tests. We will use smartctl command to help us find out what is wrong with the disk.
Lets check the overall health of disk /dev/sda:
$ smartctl -H/dev/sda
smartctl version 5.38[i686-redhat-linux-gnu] Copyright (C)2002-8 Bruce Allen
Home page is http://smartmontools.sourceforge.net/
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
It passed. But it just general information only. We need to go deeper by do self-test to the disk:
$ smartctl -q errorsonly -H-l selftest -l error /dev/sda
ATA Error Count: 2
Error 2 occurred at disk power-on lifetime: 36795 hours (1533 days + 3 hours)
Error 1 occurred at disk power-on lifetime: 31542 hours (1314 days + 6 hours)
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Short offline Completed: read failure 60% 39255 -
When I Google up the error above, it seems like the hard disk might have hardware problem. FSCK only might not helping much since it only fix logical error in file system, not the hardware error.
Errors reported by SMARTD is related to power-on lifetime attributes which explain as below (reference):
Count of hours in power-on state. The raw value of this attribute shows total count of hours (or minutes, or seconds, depending on manufacturer) in power-on state. A decrease of this attribute value to the critical level (threshold) indicates a decrease of the MTBF (Mean Time Between Failures).
However, in reality, even if the MTBF value falls to zero, it does not mean that the MTBF resource is completely exhausted and the drive will not function normally.
Since the hard disk is in read-only mode, we better do backup before proceed with any problem solving process. In this case, SCP to another server is good idea because we cannot write to the local disk at this moment. For me, “home” partition is the most important folder need to be saved:
Try remounting again the partition like step 1 but same error occurred. Proceed to next step.
3. Run full file system check using FSCK via rescue environment:
$ fsck -f-y/dev/sda2
Even the box remount correctly after that, the smartd status still haunting me up. This has force me to make final decision as my next step.
4. To avoid any sudden breakdown (since the disk already run more than 1000 days), I decided to replace the hard disk and re-install the box. Its better for me to do this as part of my maintenance task so I will not worrying much about ‘urgent’ maintenance when it breakdown during weekend or sleep time!