Heartbeat and DRBD are high-availability solutions. Heartbeat (part of the Linux-HA-Project) manages a cluster of servers and makes sure all tasks are worked on, and DRBD (Distributed Replicated Block Device) is the storage analogy, making sure all data gets sent around and is always online. Together, they can help you with making the damage after a hardware or software fail as small as possible.
The drawback of DRBD currently is that you can only write and read from the primary(master) node, the partition you mirror can not be mounted on the secondary(slave) node, you have to unmount it on the primary node(node1), then inform the drbd client on the secondary node(node2) to run as primary and then mount the partition to read and write the data.
To specify a exact partition size, you should change the units to sectors by issuing the command "u"(As explained in the command snippet above). Then create the partitions as explained in the gentoo handbook.
After you've successfully installed DRBD you'll need to create the configuration file. The following is the complete configuration.
It should be noted that "testcluster1" and "testcluster2" must match the hostname of your boxes.
Don't forget to copy this file to both node locations.
Now it's time to setup DRBD. Run the following commands on both nodes.
Now on the primary node run the following to synchronize both drbd disks:
Depending on your hardware and the size of the partition, this could take some time. Later, when everything is synced, the mirroring will be very fast. See DRBD-Performance for more information.
You can now use the /dev/drbd0 and /dev/drbd1 as normal disks even before syncing has finished. So lets go ahead and format the disks. Use what ever format you want. Do this on first node. In this example we use ext3.
Formatting the disks:
Time to create mount points, both nodes again.
Make sure no bind address is set because we need to bind to all interfaces and then limit access with iptables if need be. This needs to go on both nodes.
Now we need to install a mysql database to the shared drive. Issue the following command on both nodes.
You can revert the roles again to verify if it is syncing in both ways, when you're fast enough, you can issue these commands in a few seconds but you'll only see that drbd was faster ;-)
When you start a reboot during testing, you will have to issue
If you have different data across each node then you may have a split brain. To fix this run the following command. Note that this assumes that testcluster1 is more up to date than testcluster2. If the opposite is true reverse the commands for each.
testcluster1 / # drbdadm connect all
testcluster2 / # drbdadm -- --discard-my-data connect all
Again most of the information here needs to be run on both nodes. Go ahead and emerge heartbeat.
All of the heartbeat config is done in /etc/ha.d/. Again most of the important config files were not included by default on install so you will need to create them.
The haresources is probably the most important file to configure. This lists what init scripts need to be run and the parameters to pass to the script. The path for scripts are /etc/ha.d/resource.d/ followed by /etc/init.d.
Please note the init scripts need to follow Linux Standard Base Core Specification specifically with the function return codes.
For example the IPaddr::192.168.0.100 will run the /etc/ha.d/resource.d/IPaddr script that will create a IP Alias on eth0 with the ipaddress 192.168.0.100.
The last file tells heartbeat how to communicate with the other nodes. Because this example was based on a simulated crossover cable setup to connect the 2 nodes we will just use a crc check.
If you plan on sending the heartbeat across the network you should use something a little stronger than crc. The follow is configuration for sha1.
Finally because the /etc/ha.d/authkeys file may contain a plain text password please setup permissions on both nodes.
The drawback of DRBD currently is that you can only write and read from the primary(master) node, the partition you mirror can not be mounted on the secondary(slave) node, you have to unmount it on the primary node(node1), then inform the drbd client on the secondary node(node2) to run as primary and then mount the partition to read and write the data.
Contents[hide] |
[edit] Assumptions and starting configuration
It is assumed you have two identical Gentoo installations. If you are only here for the information and are not setting up two or more physical boxes, you can run these in a VM like VMWare. Both installations have a public static IP address, and the secondary nic should have some type of private IP address. You will also need an additional public static IP address that will be used for the "service" IP address. Everything relying on the cluster as a whole should use this IP address for services.[edit] System Configuration
Start by tweaking the network devices. It might be that your current configuration already works.File: testcluster1: /etc/conf.d/net
# External static interface. config_eth0=( "192.168.0.101 netmask 255.255.255.0 brd 192.168.0.255" ) routes_eth0=( "default gw 192.168.0.1" ) dns_servers_eth0="4.2.2.1 4.2.2.2" dns_domain_eth0="yourdomain.tld" # This is the heartbeat and disk syncing interface. config_eth1=( "10.0.0.1 netmask 255.255.255.0 brd 10.0.0.255" )
File: testcluster2: /etc/conf.d/net
# External static interface. config_eth0=( "192.168.0.102 netmask 255.255.255.0 brd 192.168.0.255" ) routes_eth0=( "default gw 192.168.0.1" ) dns_servers_eth0="4.2.2.1 4.2.2.2" dns_domain_eth0="yourdomain.tld" # This the heartbeat and disk syncing interface. config_eth1=( "10.0.0.2 netmask 255.255.255.0 brd 10.0.0.255" )
File: both machines: /etc/hosts
# IPv4 and IPv6 localhost aliases 127.0.0.1 localhost.localdomain localhost 192.168.0.100 testcluster.yourdomain.tld testcluster 192.168.0.101 testcluster1.yourdomain.tld testcluster1 192.168.0.102 testcluster2.yourdomain.tld testcluster2
[edit] Installing and configuring DRBD
[edit] Preparing your HD for DRBD
If you want to use DRBD for mirroring, you should create an extra partition where you save the data you want to mirror to other nodes(e.g. /var/lib/postgresql for postgresql or /var/www for apache). Additionally to the mirrored data DRBD needs at least 128MB to save meta-data. For example, here's how to create an additional virtual disk and put 2 partitions on it, one for Apache and the other for MySQL.Code: Partition table for DRBD |
testcluster1 / # fdisk /dev/sdb Command (m for help): p Disk /dev/sdb: 2147 MB, 2147483648 bytes 255 heads, 63 sectors/track, 261 cylinders Units = cylinders of 16065 * 512 = 8225280 bytes Disk identifier: 0x00000000 Device Boot Start End Blocks Id System /dev/sdb1 1 131 1052226 83 Linux /dev/sdb2 132 261 1044225 83 Linux |
Note: The size can be specified using fdisk with +128M when asked for the ending sector.
Note: You can make exact copy of partition table by using sfdisk. Example: making an exact copy of the partition table using sfdisk
sfdisk -d /dev/sda
[edit] Kernel Configuration
Activate the following options:Linux Kernel Configuration: Support for bindings |
Device Drivers ---> -- Connector - unified userspace <-> kernelspace linker Cryptographic API ---> -- Cryptographic algorithm manager |
[edit] Installing, configuring and running DRBD
Please note that you need to do the following one each cluster. Install DRBD:testcluster1 / # emerge -av drbd
testcluster2 / # emerge -av drbd
testcluster2 / # emerge -av drbd
Code: emerge -av drbd |
These are the packages that would be merged, in order: Calculating dependencies... done! [ebuild N ] sys-cluster/drbd-kernel-8.0.13 327 kB [ebuild N ] sys-cluster/drbd-8.0.13 0 kB Total: 2 packages (2 new), Size of downloads: 327 kB |
After you've successfully installed DRBD you'll need to create the configuration file. The following is the complete configuration.
Fix me: There is a lot of redundancy that might be better in the common section. Can someone take a look and optimize the config? |
File: /etc/drbd.conf
global { usage-count no; } common { } # # this need not be r#, you may use phony resource names, # like "resource web" or "resource mail", too # resource "drbd0" { # transfer protocol to use. # C: write IO is reported as completed, if we know it has # reached _both_ local and remote DISK. # * for critical transactional data. # B: write IO is reported as completed, if it has reached # local DISK and remote buffer cache. # * for most cases. # A: write IO is reported as completed, if it has reached # local DISK and local tcp send buffer. (see also sndbuf-size) # * for high latency networks # protocol C; handlers { # what should be done in case the cluster starts up in # degraded mode, but knows it has inconsistent data. #pri-on-incon-degr "echo '!DRBD! pri on incon-degr' | wall ; sleep 60 ; halt -f"; pri-on-incon-degr "echo 'DRBD: primary requested but inconsistent!' | wall; /etc/init.d/heartbeat stop"; #"halt -f"; pri-lost-after-sb "echo 'DRBD: primary requested but lost!' | wall; /etc/init.d/heartbeat stop"; #"halt -f"; #pri-on-incon-degr "echo o > /proc/sysrq-trigger"; #pri-lost-after-sb "echo o > /proc/sysrq-trigger"; #local-io-error "echo o > /proc/sysrq-trigger"; } startup { #The init script drbd(8) blocks the boot process until the DRBD resources are connected. When the cluster manager #starts later, it does not see a resource with internal split-brain. In case you want to limit the wait time, do it #here. Default is 0, which means unlimited. The unit is seconds. wfc-timeout 0; # 2 minutes # Wait for connection timeout if this node was a degraded cluster. # In case a degraded cluster (= cluster with only one node left) # is rebooted, this timeout value is used. # degr-wfc-timeout 120; # 2 minutes. } syncer { rate 100M; # This is now expressed with "after res-name" #group 1; al-extents 257; } net { # TODO: Should these timeouts be relative to some heartbeat settings? # timeout 60; # 6 seconds (unit = 0.1 seconds) # connect-int 10; # 10 seconds (unit = 1 second) # ping-int 10; # 10 seconds (unit = 1 second) # if the connection to the peer is lost you have the choice of # "reconnect" -> Try to reconnect (AKA WFConnection state) # "stand_alone" -> Do not reconnect (AKA StandAlone state) # "freeze_io" -> Try to reconnect but freeze all IO until # the connection is established again. # FIXME This appears to be obsoleate #on-disconnect reconnect; # FIXME Experemental Crap #cram-hmac-alg "sha256"; #shared-secret "secretPassword555"; #after-sb-0pri discard-younger-primary; #after-sb-1pri consensus; #after-sb-2pri disconnect; #rr-conflict disconnect; } disk { # if the lower level device reports io-error you have the choice of # "pass_on" -> Report the io-error to the upper layers. # Primary -> report it to the mounted file system. # Secondary -> ignore it. # "panic" -> The node leaves the cluster by doing a kernel panic. # "detach" -> The node drops its backing storage device, and # continues in disk less mode. # on-io-error pass_on; # Under fencing we understand preventive measures to avoid situations where both nodes are # primary and disconnected (AKA split brain). fencing dont-care; # In case you only want to use a fraction of the available space # you might use the "size" option here. # # size 10G; } on testcluster1 { device /dev/drbd0; disk /dev/sdb1; address 10.0.0.1:7788; meta-disk internal; } on testcluster2 { device /dev/drbd0; disk /dev/sdb1; address 10.0.0.2:7788; meta-disk internal; } } resource "drbd1" { # transfer protocol to use. # C: write IO is reported as completed, if we know it has # reached _both_ local and remote DISK. # * for critical transactional data. # B: write IO is reported as completed, if it has reached # local DISK and remote buffer cache. # * for most cases. # A: write IO is reported as completed, if it has reached # local DISK and local tcp send buffer. (see also sndbuf-size) # * for high latency networks # protocol C; handlers { # what should be done in case the cluster starts up in # degraded mode, but knows it has inconsistent data. #pri-on-incon-degr "echo '!DRBD! pri on incon-degr' | wall ; sleep 60 ; halt -f"; pri-on-incon-degr "echo 'DRBD: primary requested but inconsistent!' | wall; /etc/init.d/heartbeat stop"; #"halt -f"; pri-lost-after-sb "echo 'DRBD: primary requested but lost!' | wall; /etc/init.d/heartbeat stop"; #"halt -f"; #pri-on-incon-degr "echo o > /proc/sysrq-trigger"; #pri-lost-after-sb "echo o > /proc/sysrq-trigger"; #local-io-error "echo o > /proc/sysrq-trigger"; } startup { #The init script drbd(8) blocks the boot process until the DRBD resources are connected. When the cluster manager #starts later, it does not see a resource with internal split-brain. In case you want to limit the wait time, do it #here. Default is 0, which means unlimited. The unit is seconds. wfc-timeout 0; # 2 minutes # Wait for connection timeout if this node was a degraded cluster. # In case a degraded cluster (= cluster with only one node left) # is rebooted, this timeout value is used. # degr-wfc-timeout 120; # 2 minutes. } syncer { rate 100M; # This is now expressed with "after res-name" #group 1; al-extents 257; } net { # TODO: Should these timeouts be relative to some heartbeat settings? # timeout 60; # 6 seconds (unit = 0.1 seconds) # connect-int 10; # 10 seconds (unit = 1 second) # ping-int 10; # 10 seconds (unit = 1 second) # if the connection to the peer is lost you have the choice of # "reconnect" -> Try to reconnect (AKA WFConnection state) # "stand_alone" -> Do not reconnect (AKA StandAlone state) # "freeze_io" -> Try to reconnect but freeze all IO until # the connection is established again. # FIXME This appears to be obsoleate #on-disconnect reconnect; # FIXME Experemental Crap #cram-hmac-alg "sha256"; #shared-secret "secretPassword555"; #after-sb-0pri discard-younger-primary; #after-sb-1pri consensus; #after-sb-2pri disconnect; #rr-conflict disconnect; } disk { # if the lower level device reports io-error you have the choice of # "pass_on" -> Report the io-error to the upper layers. # Primary -> report it to the mounted file system. # Secondary -> ignore it. # "panic" -> The node leaves the cluster by doing a kernel panic. # "detach" -> The node drops its backing storage device, and # continues in disk less mode. # on-io-error pass_on; # Under fencing we understand preventive measures to avoid situations where both nodes are # primary and disconnected (AKA split brain). fencing dont-care; # In case you only want to use a fraction of the available space # you might use the "size" option here. # # size 10G; } on testcluster1 { device /dev/drbd1; disk /dev/sdb2; address 10.0.0.1:7789; meta-disk internal; } on testcluster2 { device /dev/drbd1; disk /dev/sdb2; address 10.0.0.2:7789; meta-disk internal; } }
Now it's time to setup DRBD. Run the following commands on both nodes.
testcluster1 / # modprobe drbd
testcluster1 / # drbdadm create-md drbd0
testcluster1 / # drbdadm attach drbd0
testcluster1 / # drbdadm connect drbd0
testcluster1 / # drbdadm create-md drbd1
testcluster1 / # drbdadm attach drbd1
testcluster1 / # drbdadm connect drbd1
testcluster1 / # drbdadm create-md drbd0
testcluster1 / # drbdadm attach drbd0
testcluster1 / # drbdadm connect drbd0
testcluster1 / # drbdadm create-md drbd1
testcluster1 / # drbdadm attach drbd1
testcluster1 / # drbdadm connect drbd1
testcluster2 / # modprobe drbd
testcluster2 / # drbdadm create-md drbd0
testcluster2 / # drbdadm attach drbd0
testcluster2 / # drbdadm connect drbd0
testcluster2 / # drbdadm create-md drbd1
testcluster2 / # drbdadm attach drbd1
testcluster2 / # drbdadm connect drbd1
testcluster2 / # drbdadm create-md drbd0
testcluster2 / # drbdadm attach drbd0
testcluster2 / # drbdadm connect drbd0
testcluster2 / # drbdadm create-md drbd1
testcluster2 / # drbdadm attach drbd1
testcluster2 / # drbdadm connect drbd1
Now on the primary node run the following to synchronize both drbd disks:
testcluster1 / # drbdadm -- --overwrite-data-of-peer primary drbd0
testcluster1 / # drbdadm -- --overwrite-data-of-peer primary drbd1
At this point a full synchronization should be occurring. You can monitor the progress with the following command. testcluster1 / # drbdadm -- --overwrite-data-of-peer primary drbd1
testcluster1 # / watch cat /proc/drbd
Code: monitor the synchronization progress |
version: 8.0.11 (api:86/proto:86) GIT-hash: b3fe2bdfd3b9f7c2f923186883eb9e2a0d3a5b1b build by root@localhost, 2008-04-18 11:35:09 0: cs:SyncSource st:Primary/Secondary ds:UpToDate/Inconsistent C r--- ns:2176 nr:0 dw:49792 dr:2196 al:17 bm:0 lo:0 pe:5 ua:0 ap:0 [>....................] sync'ed: 0.4% (1050136/1052152)K finish: 0:43:45 speed: 336 (336) K/sec resync: used:0/31 hits:130 misses:1 starving:0 dirty:0 changed:1 act_log: used:0/257 hits:12431 misses:35 starving:0 dirty:18 changed:17 1: cs:SyncSource st:Primary/Secondary ds:UpToDate/Inconsistent C r--- ns:163304 nr:0 dw:33252 dr:130072 al:13 bm:8 lo:0 pe:5 ua:0 ap:0 [=>..................] sync'ed: 14.6% (893640/1044156)K finish: 0:49:38 speed: 296 (360) K/sec resync: used:0/31 hits:13317 misses:16 starving:0 dirty:0 changed:16 act_log: used:0/257 hits:8300 misses:13 starving:0 dirty:0 changed:13 |
You can now use the /dev/drbd0 and /dev/drbd1 as normal disks even before syncing has finished. So lets go ahead and format the disks. Use what ever format you want. Do this on first node. In this example we use ext3.
Formatting the disks:
testcluster1 / # mke2fs -j /dev/drbd0
testcluster1 / # mke2fs -j /dev/drbd1
Now setup the primary and secondary nodes. Notice these commands are different for each node. testcluster1 / # mke2fs -j /dev/drbd1
testcluster1 # / drbdadm primary all
testcluster2 # / drbdadm secondary all
Make sure your add the mount points to the fstab and they are set to noauto. Again this needs to be done on both nodes. testcluster2 # / drbdadm secondary all
File: /etc/fstab
## NOTE: If your BOOT partition is ReiserFS, add the notail option to opts. /dev/sda1 /boot ext2 noatime 1 2 /dev/sda3 / ext3 noatime 0 1 /dev/sda2 none swap sw 0 0 /dev/cdrom /mnt/cdrom audo noauto,ro 0 0 #/dev/fd0 /mnt/floppy auto noauto 0 0 /dev/drbd0 /wwwjail/siteroot ext3 noauto 0 0 /dev/drbd1 /wwwjail/mysql ext3 noauto 0 0 proc /proc proc defaults 0 0 # glibc 2.2 and above expects tmpfs to be mounted at /dev/shm for # POSIX shared memory (shm_open, shm_unlink). # (tmpfs is a dynamically expandable/shrinkable ramdisk, and will # use almost no memory if not populated with files) shm /dev/shm tmpfs nodev,nosuid,noexec 0 0
testcluster1 / # mkdir -p /wwwjail/siteroot
testcluster1 / # mkdir -p /wwwjail/mysql
testcluster2 / # mkdir -p /wwwjail/siteroot
testcluster2 / # mkdir -p /wwwjail/mysql
You can mount them on the first node: testcluster1 / # mkdir -p /wwwjail/mysql
testcluster2 / # mkdir -p /wwwjail/siteroot
testcluster2 / # mkdir -p /wwwjail/mysql
testcluster1 / # mount /wwwjail/siteroot
testcluster1 / # mount /wwwjail/mysql
testcluster1 / # mount /wwwjail/mysql
Note: It is possible to mount the drbd partition on both nodes, but is not covered in this article. For more information see [1].
MySQL should already be installed but we need to configure it to use the DRBD device. We do that by simply putting all the databases and logs in /wwwjail/mysql. In a production environment, you'd probably break out logs, database and index files onto different devices. Since this is an experimental system, we'll just put everything into one resource. Make sure no bind address is set because we need to bind to all interfaces and then limit access with iptables if need be. This needs to go on both nodes.
File: /etc/mysql/my.cf
... #bind-address = 127.0.0.1 ... #datadir = /var/lib/mysql datadir = /wwwjail/mysql ...
testcluster1 / # mysql_install_db
testcluster2 / # mysql_install_db
Ok if everything has gone well to this point you can add DRBD to the startup items for both nodes, by adding them to the default runlevel. testcluster2 / # mysql_install_db
testcluster1 / # rc-update add drbd default
testcluster2 / # rc-update add drbd default
testcluster2 / # rc-update add drbd default
Warning: Do not add drbd to your kernel's auto loaded modules! It *will* cause issues with heartbeat.
Note: It is unclear if you need to wait for the sync to finish at this point, but it might be a good idea anyway.
After syncing (unless your brave) you should be able to start DRBD normally: testcluster1 / # /etc/init.d/drbd start
testcluster2 / # /etc/init.d/drbd start
The DRBD service should automatically load the drbd-kernel module automatically: testcluster2 / # /etc/init.d/drbd start
testcluster1 etc # lsmod
Code: kernel modules listing |
Module Size Used by drbd 142176 1 |
When you start a reboot during testing, you will have to issue
# drbdadm primary all
on the node were you want the data to be retrieved from, as DRBD does not remember the roles(for drbd both nodes are equal). We will do this automatically with heartbeat later. If you have different data across each node then you may have a split brain. To fix this run the following command. Note that this assumes that testcluster1 is more up to date than testcluster2. If the opposite is true reverse the commands for each.
testcluster1 / # drbdadm connect all
testcluster2 / # drbdadm -- --discard-my-data connect all
[edit] Installing and Configuring Heartbeat
Heartbeat is based on init scripts so setting it up todo advanced things is not that difficult but its not going to be covered in this doc.Again most of the information here needs to be run on both nodes. Go ahead and emerge heartbeat.
emerge -av heartbeat
Code: emerging heartbeat |
These are the packages that would be merged, in order: Calculating dependencies... done! [ebuild N ] sys-cluster/heartbeat-2.0.7-r2 USE="-doc -ldirectord -management -snmp" 3,250 kB Total: 1 package (1 new), Size of downloads: 3,250 kB |
All of the heartbeat config is done in /etc/ha.d/. Again most of the important config files were not included by default on install so you will need to create them.
File: /etc/ha.d/ha.cf
# What interfaces to heartbeat over? #udp eth1 bcast eth1 # keepalive: how many seconds between heartbeats keepalive 2 # Time in seconds before issuing a "late heartbeat" warning in the logs. warntime 10 # Node is pronounced dead after 30 seconds. deadtime 15 # With some configurations, the network takes some time to start working after a reboot. # This is a separate "deadtime" to handle that case. It should be at least twice the normal deadtime. initdead 30 # Mandatory. Hostname of machine in cluster as described by uname -n. node testcluster1 node testcluster2 # When auto_failback is set to on once the master comes back online, it will take # everything back from the slave. auto_failback off # Some default uid, gid info, This is required for ipfail apiauth default uid=nobody gid=cluster apiauth ipfail uid=cluster apiauth ping gid=nobody uid=nobody,cluster # This is to fail over if the outbound network connection goes down. respawn cluster /usr/lib/heartbeat/ipfail # IP to ping to check to see if the external connection is up. ping 192.168.0.1 deadping 15 debugfile /var/log/ha-debug # File to write other messages to logfile /var/log/ha-log # Facility to use for syslog()/logger logfacility local0
Please note the init scripts need to follow Linux Standard Base Core Specification specifically with the function return codes.
Fix me: Can anyone lookup if this is the correct format for this file? |
File: both nodes: /etc/ha.d/haresoures
testcluster1 IPaddr::192.168.0.100 testcluster1 drbddisk::drbd0 Filesystem::/dev/drbd0::/wwwjail/siteroot::ext3::noatime apache2 testcluster1 drbddisk::drbd1 Filesystem::/dev/drbd1::/wwwjail/mysql::ext3::noatime mysql
Warning: The contents of the haresources file must be exactly the same on both nodes!
drbddisk will run the drbdadmin primary drbd0 and Filesystem is basically just a mount. The last file tells heartbeat how to communicate with the other nodes. Because this example was based on a simulated crossover cable setup to connect the 2 nodes we will just use a crc check.
File: both nodes: /etc/ha.d/authkeys
auth 1 1 crc
File: both nodes: /etc/ha.d/authkeys
auth 1 1 sha1 MySharedSecretPassword
testcluster1 / # chown root:root /etc/ha.d/authkeys
testcluster1 / # chmod 600 /etc/ha.d/authkeys
testcluster2 / # chown root:root /etc/ha.d/authkeys
testcluster2 / # chmod 600 /etc/ha.d/authkeys
testcluster1 / # chmod 600 /etc/ha.d/authkeys
testcluster2 / # chown root:root /etc/ha.d/authkeys
testcluster2 / # chmod 600 /etc/ha.d/authkeys
[edit] References
- http://www.drbd.org/users-guide/s-heartbeat-config.html
- http://www.drbd.org/fileadmin/drbd/doc/8.0.2/en/drbd.conf.html
- http://www.linux-ha.org/HeartbeatTutorials
- http://www.linux-ha.org/GettingStarted
- http://www.linuxjournal.com/article/5862
- http://www.linuxjournal.com/article/9074
- http://cvs.linux-ha.org/viewcvs/viewcvs.cgi/linux-ha/doc/GettingStarted.html?rev=1.29