I am encountering following error in /var/log/messages:
Which cause the / partition to be mounted as read-only. The server is accessible anyway but you cant do anything much inside. Lets troubleshoot this.
Collecting Information/Troubleshooting
I see read-only filesystem mounted when creating a test file in /root directory:
What is SMART daemon (smartd)?
Self-Monitoring, Analysis and Reporting Technology (SMART) system built into many ATA-3 and later ATA, IDE and SCSI-3 hard drives. The purpose of SMART is to monitor the reliability of the hard drive and predict drive failures, and to carry out different types of drive self-tests. We will use smartctl command to help us find out what is wrong with the disk.
Lets check the overall health of disk /dev/sda:
It passed. But it just general information only. We need to go deeper by do self-test to the disk:
When I Google up the error above, it seems like the hard disk might have hardware problem. FSCK only might not helping much since it only fix logical error in file system, not the hardware error.
Errors reported by SMARTD is related to power-on lifetime attributes which explain as below (reference):
Count of hours in power-on state. The raw value of this attribute shows total count of hours (or minutes, or seconds, depending on manufacturer) in power-on state. A decrease of this attribute value to the critical level (threshold) indicates a decrease of the MTBF (Mean Time Between Failures).
However, in reality, even if the MTBF value falls to zero, it does not mean that the MTBF resource is completely exhausted and the drive will not function normally.
Backup
Since the hard disk is in read-only mode, we better do backup before proceed with any problem solving process. In this case, SCP to another server is good idea because we cannot write to the local disk at this moment. For me, “home” partition is the most important folder need to be saved:
Problem Solving Process
1. Remount the / partition:
2. Run e2fsck command to check ext3 file system online:
Try remounting again the partition like step 1 but same error occurred. Proceed to next step.
3. Run full file system check using FSCK via rescue environment:
Even the box remount correctly after that, the smartd status still haunting me up. This has force me to make final decision as my next step.
4. To avoid any sudden breakdown (since the disk already run more than 1000 days), I decided to replace the hard disk and re-install the box. Its better for me to do this as part of my maintenance task so I will not worrying much about ‘urgent’ maintenance when it breakdown during weekend or sleep time!
REFERENCES
http://blog.secaserver.com/2011/08/smartd-error-1-unreadable-pending-sectors/