Are you responsibly for one or more servers. Perhaps you have a computer at home that you worry about at night, "What happens if my hard drive fails?" If this is you, then you need SmartMonTools. Actually, it comes pre-installed on most flavours of Linux these days, but amazingly enough, it is not set to run automatically.
SmartMonTools will monitor your Self Monitoring And Reporting Technology (S.M.A.R.T.) enable hard drives for potential problems which can occur before a hard drive completely files. If properly setup, it will warn you of these potential issues and possibly save your data. Of course you have a proper backup system in case just such a disaster should occur.
I am assuming that SmartMonTools is already installed on your machine, but if not, you can get it here http://smartmontools.sourceforge.net/.
First step is to see if your hard drives are S.M.A.R.T. enabled. You can do this using the smartctl application that comes with SmartMonTools. Here is the output I get when I run 'smartctl -d ata -i /dev/sda'
# smartctl -d ata -i /dev/sda
smartctl version 5.36 [i686-redhat-linux-gnu] Copyright (C) 2002-6 Bruce Allen
Home page is http://smartmontools.sourceforge.net/
=== START OF INFORMATION SECTION ===
Device Model: ST3500630AS
Serial Number: 3QG02JST
Firmware Version: 3.AAC
User Capacity: 500,107,862,016 bytes
Device is: Not in smartctl database [for details use: -P showall]
ATA Version is: 7
ATA Standard is: Exact ATA specification draft version not indicated
Local Time is: Thu Jul 10 05:18:47 2008 PDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
Those last two lines are what we are looking for. This drive is SMART enabled, so we are good to go. A couple of comments about the command I issued. If you want more information about your hard drive, try using the -a flag, which will show a lot about your hard drive. The '-d ata' flag was required for me to tell smartctl that I am going to check an ata drive. You may not require the -d flag.
The next step is to modify the /etc/smartd.conf file. Using your favourite editor, open up /etc/smartd.conf. The first thing you will do is remove the first line of the file. This line tells smartd that you have modified the file and not to over-write it. If you don't have a smartd.conf file, then you can auto-generate the first version simply by starting and stopping smartd with /etc/init.d/smartd start and then /etc/init.d/smartd stop.
Modify the conf file so that our drives will be monitored regularly. Here is my conf file:
# Remove the line above if you have edited the file and you do not want
# it to be overwritten on the next smartd startup.
<SNIP>
/dev/sda -d ata -H -m me@mydomain.ca -M test
/dev/sdb -d ata -H -m me@ mydomain.ca -M test
<SNIP>
First off, you will see that I defined the '-d ata' device flag. The -H flag is telling smartd to monitor the Health of the drive. -m is telling smartd to mail someone, in this case me, of any issues. The '-M test' flag can only be used in conjunction with the -m flag and in this case is telling smartd to send a test email to me on start up. I have added the -M flag as I want to be sure that smartd is really working and can email me.
At the bottom of this post is a partial list of flags that you can use with smartd.
If we try to start smartd right now, you will most likely be disappointed as nothing will happen. We first need to force smartd to see our drives by registering our hard drives with smartd. We can do this by running a quick CLI command for each drive:
echo /dev/sda -d ata -m me@mydomain.ca -M test | smartd -c - -q onecheck
We are piping a string of commands to smartd. The commands should look familiar to you, so I won't go over them again. The flags for smartd in this example are a little different, so lets go over those now. The -c flag is telling smartd to use a specific configuration file. The next single dash, when used with the smartd -c flag, is telling smartd to not use any configuration file, but rather, just accept commands piped in. The -q flag is telling smartd when it should quit. In this case, we are telling smartd to register our drive, run one check on the drive, and then quit. This command line serves two purposes, it registers the device, then verifies that an email can be sent out.
Here is what I get when I run this command:
echo /dev/sda -d ata -m me@mydomain.ca -M test | smartd -c - -q onecheck
smartd version 5.36 [i686-redhat-linux-gnu] Copyright (C) 2002-6 Bruce Allen
Home page is http://smartmontools.sourceforge.net/
Opened configuration file <stdin>
Drive: /dev/sda, implied '-a' Directive on line 1 of file <stdin>
Configuration file <stdin> parsed.
Device: /dev/sda, opened
Device: /dev/sda, not found in smartd database.
Device: /dev/sda, is SMART capable. Adding to "monitor" list.
Monitoring 1 ATA and 0 SCSI devices
Executing test of mail to me@mydomain.ca ...
Test of mail to me@mydomain.ca: successful
Started with '-q onecheck' option. All devices sucessfully checked once.
smartd is exiting (exit status 0)
The important line here is:
Device: /dev/sda, is SMART capable. Adding to "monitor" list.
We have now registered /dev/sda with smartd, and smartd will now monitor this device. In my inbox I got this email:
This email was generated by the smartd daemon running on:
host name: server.mydomain.ca
DNS domain: mydomain.ca
NIS domain: (none)
The following warning/error was logged by the smartd daemon:
TEST EMAIL from smartd for device: /dev/sda
For details see host's SYSLOG (default: /var/log/messages).
Once you have successfully run the command for all your devices, you can now fire up smartd with '/etc/init.d/smartd start'. If all went well you should have an email like above in your inbox for each device you set up in the config file. This is telling you that the daemon is running, and can send an email when an issue occurs. The last step is to remove the '-M test' flag from each device your /etc/smartd.conf file. Then restart smartd again with '/etc/init.d/smartd restart'.
Be sure that you have added smartd to your init levels 3, 4 and 5 with this command:
chkconfig --level 345 smartd on
That's it for today. Hopefully it will help you sleep better at night.
# HERE IS A LIST OF DIRECTIVES FOR THIS CONFIGURATION FILE
# -d TYPE Set the device type to one of: ata, scsi
# -T TYPE set the tolerance to one of: normal, permissive
# -o VAL Enable/disable automatic offline tests (on/off)
# -S VAL Enable/disable attribute autosave (on/off)
# -H Monitor SMART Health Status, report if failed
# -l TYPE Monitor SMART log. Type is one of: error, selftest
# -f Monitor for failure of any 'Usage' Attributes
# -m ADD Send warning email to ADD for -H, -l error, -l selftest, and -f
# -M TYPE Modify email warning behavior (see man page)
# -p Report changes in 'Prefailure' Normalized Attributes
# -u Report changes in 'Usage' Normalized Attributes
# -t Equivalent to -p and -u Directives
# -r ID Also report Raw values of Attribute ID with -p, -u or -t
# -R ID Track changes in Attribute ID Raw value with -p, -u or -t
# -i ID Ignore Attribute ID for -f Directive
# -I ID Ignore Attribute ID for -p, -u or -t Directive
# -v N,ST Modifies labeling of Attribute N (see man page)
# -a Default: equivalent to -H -f -t -l error -l selftest
# -F TYPE Use firmware bug workaround. Type is one of: none, samsung
# -P TYPE Drive-specific presets: use, ignore, show, showall
# # Comment: text after a hash sign is ignored
# \ Line continuation character
# Attribute ID is a decimal integer 1 <= ID <= 255
# All but -d, -m and -M Directives are only implemented for ATA devices
http://www.outofcontrol.ca/2008/07/10/use-smartd-smartmontools-to-prevent-a-disaster/