Friday, January 22, 2010

Heartbeat and DRBD

SkyHi @ Friday, January 22, 2010
Heartbeat and DRBD are high-availability solutions. Heartbeat (part of the Linux-HA-Project) manages a cluster of servers and makes sure all tasks are worked on, and DRBD (Distributed Replicated Block Device) is the storage analogy, making sure all data gets sent around and is always online. Together, they can help you with making the damage after a hardware or software fail as small as possible.
The drawback of DRBD currently is that you can only write and read from the primary(master) node, the partition you mirror can not be mounted on the secondary(slave) node, you have to unmount it on the primary node(node1), then inform the drbd client on the secondary node(node2) to run as primary and then mount the partition to read and write the data.

Contents

[hide]

[edit] Assumptions and starting configuration

It is assumed you have two identical Gentoo installations. If you are only here for the information and are not setting up two or more physical boxes, you can run these in a VM like VMWare. Both installations have a public static IP address, and the secondary nic should have some type of private IP address. You will also need an additional public static IP address that will be used for the "service" IP address. Everything relying on the cluster as a whole should use this IP address for services.

[edit] System Configuration

Start by tweaking the network devices. It might be that your current configuration already works.
File: testcluster1: /etc/conf.d/net
# External static interface.
config_eth0=( "192.168.0.101 netmask 255.255.255.0 brd 192.168.0.255" )
routes_eth0=( "default gw 192.168.0.1" )
dns_servers_eth0="4.2.2.1 4.2.2.2"
dns_domain_eth0="yourdomain.tld"

# This is the heartbeat and disk syncing interface.
config_eth1=( "10.0.0.1 netmask 255.255.255.0 brd 10.0.0.255" )
File: testcluster2: /etc/conf.d/net
# External static interface.
config_eth0=( "192.168.0.102 netmask 255.255.255.0 brd 192.168.0.255" )
routes_eth0=( "default gw 192.168.0.1" )
dns_servers_eth0="4.2.2.1 4.2.2.2"
dns_domain_eth0="yourdomain.tld"

# This the heartbeat and disk syncing interface.
config_eth1=( "10.0.0.2 netmask 255.255.255.0 brd 10.0.0.255" )
File: both machines: /etc/hosts
# IPv4 and IPv6 localhost aliases
127.0.0.1            localhost.localdomain localhost
192.168.0.100         testcluster.yourdomain.tld testcluster
192.168.0.101         testcluster1.yourdomain.tld testcluster1
192.168.0.102         testcluster2.yourdomain.tld testcluster2

[edit] Installing and configuring DRBD

[edit] Preparing your HD for DRBD

If you want to use DRBD for mirroring, you should create an extra partition where you save the data you want to mirror to other nodes(e.g. /var/lib/postgresql for postgresql or /var/www for apache). Additionally to the mirrored data DRBD needs at least 128MB to save meta-data. For example, here's how to create an additional virtual disk and put 2 partitions on it, one for Apache and the other for MySQL.
Code: Partition table for DRBD
testcluster1 / # fdisk /dev/sdb

Command (m for help): p

Disk /dev/sdb: 2147 MB, 2147483648 bytes
255 heads, 63 sectors/track, 261 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Disk identifier: 0x00000000

   Device Boot      Start         End      Blocks   Id  System
/dev/sdb1               1         131     1052226   83  Linux
/dev/sdb2             132         261     1044225   83  Linux
To specify a exact partition size, you should change the units to sectors by issuing the command "u"(As explained in the command snippet above). Then create the partitions as explained in the gentoo handbook.
Note: The size can be specified using fdisk with +128M when asked for the ending sector.
Note: You can make exact copy of partition table by using sfdisk. Example: making an exact copy of the partition table using sfdisk
sfdisk -d /dev/sda

[edit] Kernel Configuration

Activate the following options:
Linux Kernel Configuration: Support for bindings
Device Drivers --->
 -- Connector - unified userspace <-> kernelspace linker 

Cryptographic API --->
 -- Cryptographic algorithm manager

[edit] Installing, configuring and running DRBD

Please note that you need to do the following one each cluster. Install DRBD:
testcluster1 / # emerge -av drbd
testcluster2 / # emerge -av drbd
Code: emerge -av drbd
These are the packages that would be merged, in order:

Calculating dependencies... done!
[ebuild  N    ] sys-cluster/drbd-kernel-8.0.13  327 kB
[ebuild  N    ] sys-cluster/drbd-8.0.13  0 kB

Total: 2 packages (2 new), Size of downloads: 327 kB

After you've successfully installed DRBD you'll need to create the configuration file. The following is the complete configuration.
Fix me: There is a lot of redundancy that might be better in the common section. Can someone take a look and optimize the config?
It should be noted that "testcluster1" and "testcluster2" must match the hostname of your boxes.
File: /etc/drbd.conf
global {
        usage-count no;
}

common {
}


#
# this need not be r#, you may use phony resource names,
# like "resource web" or "resource mail", too
#

resource "drbd0" {
        # transfer protocol to use.
        # C: write IO is reported as completed, if we know it has
        #    reached _both_ local and remote DISK.
        #    * for critical transactional data.
        # B: write IO is reported as completed, if it has reached
        #    local DISK and remote buffer cache.
        #    * for most cases.
        # A: write IO is reported as completed, if it has reached
        #    local DISK and local tcp send buffer. (see also sndbuf-size)
        #    * for high latency networks
        #
        protocol C;

        handlers {
                # what should be done in case the cluster starts up in
                # degraded mode, but knows it has inconsistent data.
                #pri-on-incon-degr "echo '!DRBD! pri on incon-degr' | wall ; sleep 60 ; halt -f";

                pri-on-incon-degr "echo 'DRBD: primary requested but inconsistent!' | wall; /etc/init.d/heartbeat stop"; #"halt -f";
                pri-lost-after-sb "echo 'DRBD: primary requested but lost!' | wall; /etc/init.d/heartbeat stop"; #"halt -f";

                #pri-on-incon-degr "echo o > /proc/sysrq-trigger";
                #pri-lost-after-sb "echo o > /proc/sysrq-trigger";
                #local-io-error "echo o > /proc/sysrq-trigger";
        }

        startup {
                #The init script drbd(8) blocks the boot process until the DRBD resources are connected.  When the  cluster  manager
                #starts later, it does not see a resource with internal split-brain.  In case you want to limit the wait time, do it
                #here.  Default is 0, which means unlimited. The unit is seconds.
                wfc-timeout 0;  # 2 minutes

                # Wait for connection timeout if this node was a degraded cluster.
                # In case a degraded cluster (= cluster with only one node left)
                # is rebooted, this timeout value is used.
                #
                degr-wfc-timeout 120;    # 2 minutes.
        }

        syncer {
                rate 100M;
                # This is now expressed with "after res-name"
                #group 1;
                al-extents 257;
        }

        net {
                # TODO: Should these timeouts be relative to some heartbeat settings?
                # timeout       60;    #  6 seconds  (unit = 0.1 seconds)
                # connect-int   10;    # 10 seconds  (unit = 1 second)
                # ping-int      10;    # 10 seconds  (unit = 1 second)

                # if the connection to the peer is lost you have the choice of
                #  "reconnect"   -> Try to reconnect (AKA WFConnection state)
                #  "stand_alone" -> Do not reconnect (AKA StandAlone state)
                #  "freeze_io"   -> Try to reconnect but freeze all IO until
                #                   the connection is established again.
                # FIXME This appears to be obsoleate
                #on-disconnect reconnect;

                # FIXME Experemental Crap
                #cram-hmac-alg "sha256";
                #shared-secret "secretPassword555";
                #after-sb-0pri discard-younger-primary;
                #after-sb-1pri consensus;
                #after-sb-2pri disconnect;
                #rr-conflict disconnect;
        }

        disk {
                # if the lower level device reports io-error you have the choice of
                #  "pass_on"  ->  Report the io-error to the upper layers.
                #                 Primary   -> report it to the mounted file system.
                #                 Secondary -> ignore it.
                #  "panic"    ->  The node leaves the cluster by doing a kernel panic.
                #  "detach"   ->  The node drops its backing storage device, and
                #                 continues in disk less mode.
                #
                on-io-error   pass_on;

                # Under  fencing  we understand preventive measures to avoid situations where both nodes are
                # primary and disconnected (AKA split brain).
                fencing dont-care;

                # In case you only want to use a fraction of the available space
                # you might use the "size" option here.
                #
                # size 10G;
        }


        on testcluster1 {
                device          /dev/drbd0;
                disk            /dev/sdb1;
                address         10.0.0.1:7788;
                meta-disk       internal;
        }

        on testcluster2 {
                device          /dev/drbd0;
                disk            /dev/sdb1;
                address         10.0.0.2:7788;
                meta-disk       internal;
        }
}

resource "drbd1" {
        # transfer protocol to use.
        # C: write IO is reported as completed, if we know it has
        #    reached _both_ local and remote DISK.
        #    * for critical transactional data.
        # B: write IO is reported as completed, if it has reached
        #    local DISK and remote buffer cache.
        #    * for most cases.
        # A: write IO is reported as completed, if it has reached
        #    local DISK and local tcp send buffer. (see also sndbuf-size)
        #    * for high latency networks
        #
        protocol C;

        handlers {
                # what should be done in case the cluster starts up in
                # degraded mode, but knows it has inconsistent data.
                #pri-on-incon-degr "echo '!DRBD! pri on incon-degr' | wall ; sleep 60 ; halt -f";

                pri-on-incon-degr "echo 'DRBD: primary requested but inconsistent!' | wall; /etc/init.d/heartbeat stop"; #"halt -f";
                pri-lost-after-sb "echo 'DRBD: primary requested but lost!' | wall; /etc/init.d/heartbeat stop"; #"halt -f";

                #pri-on-incon-degr "echo o > /proc/sysrq-trigger";
                #pri-lost-after-sb "echo o > /proc/sysrq-trigger";
                #local-io-error "echo o > /proc/sysrq-trigger";
        }

        startup {
                #The init script drbd(8) blocks the boot process until the DRBD resources are connected.  When the  cluster  manager
                #starts later, it does not see a resource with internal split-brain.  In case you want to limit the wait time, do it
                #here.  Default is 0, which means unlimited. The unit is seconds.
                wfc-timeout 0;  # 2 minutes

                # Wait for connection timeout if this node was a degraded cluster.
                # In case a degraded cluster (= cluster with only one node left)
                # is rebooted, this timeout value is used.
                #
                degr-wfc-timeout 120;    # 2 minutes.
        }

        syncer {
                rate 100M;
                # This is now expressed with "after res-name"
                #group 1;
                al-extents 257;
        }

        net {
                # TODO: Should these timeouts be relative to some heartbeat settings?
                # timeout       60;    #  6 seconds  (unit = 0.1 seconds)
                # connect-int   10;    # 10 seconds  (unit = 1 second)
                # ping-int      10;    # 10 seconds  (unit = 1 second)

                # if the connection to the peer is lost you have the choice of
                #  "reconnect"   -> Try to reconnect (AKA WFConnection state)
                #  "stand_alone" -> Do not reconnect (AKA StandAlone state)
                #  "freeze_io"   -> Try to reconnect but freeze all IO until
                #                   the connection is established again.
                # FIXME This appears to be obsoleate
                #on-disconnect reconnect;

                # FIXME Experemental Crap
                #cram-hmac-alg "sha256";
                #shared-secret "secretPassword555";
                #after-sb-0pri discard-younger-primary;
                #after-sb-1pri consensus;
                #after-sb-2pri disconnect;
                #rr-conflict disconnect;
        }

        disk {
                # if the lower level device reports io-error you have the choice of
                #  "pass_on"  ->  Report the io-error to the upper layers.
                #                 Primary   -> report it to the mounted file system.
                #                 Secondary -> ignore it.
                #  "panic"    ->  The node leaves the cluster by doing a kernel panic.
                #  "detach"   ->  The node drops its backing storage device, and
                #                 continues in disk less mode.
                #
                on-io-error   pass_on;

                # Under  fencing  we understand preventive measures to avoid situations where both nodes are
                # primary and disconnected (AKA split brain).
                fencing dont-care;

                # In case you only want to use a fraction of the available space
                # you might use the "size" option here.
                #
                # size 10G;
        }


        on testcluster1 {
                device          /dev/drbd1;
                disk            /dev/sdb2;
                address         10.0.0.1:7789;
                meta-disk       internal;
        }

        on testcluster2 {
                device          /dev/drbd1;
                disk            /dev/sdb2;
                address         10.0.0.2:7789;
                meta-disk       internal;
        }
}
Don't forget to copy this file to both node locations.
Now it's time to setup DRBD. Run the following commands on both nodes.
testcluster1 / # modprobe drbd
testcluster1 / # drbdadm create-md drbd0
testcluster1 / # drbdadm attach drbd0
testcluster1 / # drbdadm connect drbd0
testcluster1 / # drbdadm create-md drbd1
testcluster1 / # drbdadm attach drbd1

testcluster1 / # drbdadm connect drbd1
testcluster2 / # modprobe drbd
testcluster2 / # drbdadm create-md drbd0
testcluster2 / # drbdadm attach drbd0
testcluster2 / # drbdadm connect drbd0
testcluster2 / # drbdadm create-md drbd1
testcluster2 / # drbdadm attach drbd1

testcluster2 / # drbdadm connect drbd1

Now on the primary node run the following to synchronize both drbd disks:
testcluster1 / # drbdadm -- --overwrite-data-of-peer primary drbd0
testcluster1 / # drbdadm -- --overwrite-data-of-peer primary drbd1
At this point a full synchronization should be occurring. You can monitor the progress with the following command.
testcluster1 # / watch cat /proc/drbd
Code: monitor the synchronization progress
version: 8.0.11 (api:86/proto:86)
GIT-hash: b3fe2bdfd3b9f7c2f923186883eb9e2a0d3a5b1b build by root@localhost, 2008-04-18 11:35:09
 0: cs:SyncSource st:Primary/Secondary ds:UpToDate/Inconsistent C r---
    ns:2176 nr:0 dw:49792 dr:2196 al:17 bm:0 lo:0 pe:5 ua:0 ap:0
       [>....................] sync'ed:  0.4% (1050136/1052152)K
       finish: 0:43:45 speed: 336 (336) K/sec
       resync: used:0/31 hits:130 misses:1 starving:0 dirty:0 changed:1
       act_log: used:0/257 hits:12431 misses:35 starving:0 dirty:18 changed:17
 1: cs:SyncSource st:Primary/Secondary ds:UpToDate/Inconsistent C r---
    ns:163304 nr:0 dw:33252 dr:130072 al:13 bm:8 lo:0 pe:5 ua:0 ap:0
       [=>..................] sync'ed: 14.6% (893640/1044156)K
       finish: 0:49:38 speed: 296 (360) K/sec
       resync: used:0/31 hits:13317 misses:16 starving:0 dirty:0 changed:16
       act_log: used:0/257 hits:8300 misses:13 starving:0 dirty:0 changed:13
Depending on your hardware and the size of the partition, this could take some time. Later, when everything is synced, the mirroring will be very fast. See DRBD-Performance for more information.
You can now use the /dev/drbd0 and /dev/drbd1 as normal disks even before syncing has finished. So lets go ahead and format the disks. Use what ever format you want. Do this on first node. In this example we use ext3.
Formatting the disks:
testcluster1 / # mke2fs -j /dev/drbd0
testcluster1 / # mke2fs -j /dev/drbd1
Now setup the primary and secondary nodes. Notice these commands are different for each node.
testcluster1 # / drbdadm primary all
testcluster2 # / drbdadm secondary all
Make sure your add the mount points to the fstab and they are set to noauto. Again this needs to be done on both nodes.
File: /etc/fstab
#                                           

# NOTE: If your BOOT partition is ReiserFS, add the notail option to opts.
/dev/sda1               /boot           ext2            noatime         1 2
/dev/sda3               /               ext3            noatime         0 1
/dev/sda2               none            swap            sw              0 0
/dev/cdrom              /mnt/cdrom      audo            noauto,ro       0 0
#/dev/fd0               /mnt/floppy     auto            noauto          0 0
/dev/drbd0              /wwwjail/siteroot       ext3    noauto          0 0
/dev/drbd1              /wwwjail/mysql          ext3    noauto          0 0

proc                    /proc           proc            defaults        0 0

# glibc 2.2 and above expects tmpfs to be mounted at /dev/shm for
# POSIX shared memory (shm_open, shm_unlink).
# (tmpfs is a dynamically expandable/shrinkable ramdisk, and will
#  use almost no memory if not populated with files)
shm                     /dev/shm        tmpfs           nodev,nosuid,noexec     0 0
Time to create mount points, both nodes again.
testcluster1 / # mkdir -p /wwwjail/siteroot
testcluster1 / # mkdir -p /wwwjail/mysql
testcluster2 / # mkdir -p /wwwjail/siteroot

testcluster2 / # mkdir -p /wwwjail/mysql
You can mount them on the first node:
testcluster1 / # mount /wwwjail/siteroot
testcluster1 / # mount /wwwjail/mysql
Note: It is possible to mount the drbd partition on both nodes, but is not covered in this article. For more information see [1].
MySQL should already be installed but we need to configure it to use the DRBD device. We do that by simply putting all the databases and logs in /wwwjail/mysql. In a production environment, you'd probably break out logs, database and index files onto different devices. Since this is an experimental system, we'll just put everything into one resource.
Make sure no bind address is set because we need to bind to all interfaces and then limit access with iptables if need be. This needs to go on both nodes.
File: /etc/mysql/my.cf
...
#bind-address                           = 127.0.0.1
...
#datadir                                        = /var/lib/mysql
datadir                                         = /wwwjail/mysql
...
Now we need to install a mysql database to the shared drive. Issue the following command on both nodes.
testcluster1 / # mysql_install_db
testcluster2 / # mysql_install_db
Ok if everything has gone well to this point you can add DRBD to the startup items for both nodes, by adding them to the default runlevel.
testcluster1 / # rc-update add drbd default
testcluster2 / # rc-update add drbd default
Warning: Do not add drbd to your kernel's auto loaded modules! It *will* cause issues with heartbeat.
Note: It is unclear if you need to wait for the sync to finish at this point, but it might be a good idea anyway.
After syncing (unless your brave) you should be able to start DRBD normally:
testcluster1 / # /etc/init.d/drbd start
testcluster2 / # /etc/init.d/drbd start
The DRBD service should automatically load the drbd-kernel module automatically:
testcluster1 etc # lsmod
Code: kernel modules listing
Module                  Size  Used by
drbd                  142176  1 
You can revert the roles again to verify if it is syncing in both ways, when you're fast enough, you can issue these commands in a few seconds but you'll only see that drbd was faster ;-)
When you start a reboot during testing, you will have to issue
# drbdadm primary all
on the node were you want the data to be retrieved from, as DRBD does not remember the roles(for drbd both nodes are equal). We will do this automatically with heartbeat later.
If you have different data across each node then you may have a split brain. To fix this run the following command. Note that this assumes that testcluster1 is more up to date than testcluster2. If the opposite is true reverse the commands for each.

testcluster1 / # drbdadm connect all
testcluster2 / # drbdadm -- --discard-my-data connect all

[edit] Installing and Configuring Heartbeat

Heartbeat is based on init scripts so setting it up todo advanced things is not that difficult but its not going to be covered in this doc.
Again most of the information here needs to be run on both nodes. Go ahead and emerge heartbeat.
emerge -av heartbeat
Code: emerging heartbeat
These are the packages that would be merged, in order:

Calculating dependencies... done!
[ebuild  N    ] sys-cluster/heartbeat-2.0.7-r2  USE="-doc -ldirectord -management -snmp" 3,250 kB

Total: 1 package (1 new), Size of downloads: 3,250 kB

All of the heartbeat config is done in /etc/ha.d/. Again most of the important config files were not included by default on install so you will need to create them.
File: /etc/ha.d/ha.cf
# What interfaces to heartbeat over?
#udp  eth1
bcast eth1

# keepalive: how many seconds between heartbeats
keepalive 2

# Time in seconds before issuing a "late heartbeat" warning in the logs.
warntime 10

# Node is pronounced dead after 30 seconds.
deadtime 15

# With some configurations, the network takes some time to start working after a reboot.
# This is a separate "deadtime" to handle that case. It should be at least twice the normal deadtime.
initdead 30

# Mandatory. Hostname of machine in cluster as described by uname -n.
node    testcluster1
node    testcluster2


# When auto_failback is set to on once the master comes back online, it will take
# everything back from the slave.
auto_failback off

# Some default uid, gid info, This is required for ipfail
apiauth default uid=nobody gid=cluster
apiauth ipfail uid=cluster
apiauth ping gid=nobody uid=nobody,cluster

# This is to fail over if the outbound network connection goes down.
respawn cluster /usr/lib/heartbeat/ipfail

# IP to ping to check to see if the external connection is up.
ping 192.168.0.1
deadping 15

debugfile /var/log/ha-debug

# File to write other messages to
logfile /var/log/ha-log

# Facility to use for syslog()/logger
logfacility     local0
The haresources is probably the most important file to configure. This lists what init scripts need to be run and the parameters to pass to the script. The path for scripts are /etc/ha.d/resource.d/ followed by /etc/init.d.
Please note the init scripts need to follow Linux Standard Base Core Specification specifically with the function return codes.
Fix me: Can anyone lookup if this is the correct format for this file?
File: both nodes: /etc/ha.d/haresoures
testcluster1 IPaddr::192.168.0.100
testcluster1 drbddisk::drbd0 Filesystem::/dev/drbd0::/wwwjail/siteroot::ext3::noatime apache2
testcluster1 drbddisk::drbd1 Filesystem::/dev/drbd1::/wwwjail/mysql::ext3::noatime mysql
For example the IPaddr::192.168.0.100 will run the /etc/ha.d/resource.d/IPaddr script that will create a IP Alias on eth0 with the ipaddress 192.168.0.100.
Warning: The contents of the haresources file must be exactly the same on both nodes!
drbddisk will run the drbdadmin primary drbd0 and Filesystem is basically just a mount.
The last file tells heartbeat how to communicate with the other nodes. Because this example was based on a simulated crossover cable setup to connect the 2 nodes we will just use a crc check.
File: both nodes: /etc/ha.d/authkeys
auth 1
1 crc
If you plan on sending the heartbeat across the network you should use something a little stronger than crc. The follow is configuration for sha1.
File: both nodes: /etc/ha.d/authkeys
auth 1
1 sha1 MySharedSecretPassword
Finally because the /etc/ha.d/authkeys file may contain a plain text password please setup permissions on both nodes.
testcluster1 / # chown root:root /etc/ha.d/authkeys
testcluster1 / # chmod 600 /etc/ha.d/authkeys
testcluster2 / # chown root:root /etc/ha.d/authkeys

testcluster2 / # chmod 600 /etc/ha.d/authkeys

[edit] References