Showing posts with label DRBD. Show all posts
Showing posts with label DRBD. Show all posts

Friday, January 22, 2010

A HA Two-Node Server Side Cluster Using Glusterfs and CentOS

SkyHi @ Friday, January 22, 2010
I came across Glusterfs the other day. On the surface it seemed similar to DRBD but after closer examination I did realize it was completely different. After some reading I came to realize it may offer benefits over DRDB and GNBD and it seemed extremely straight forward to implement so I decided to test a two-node cluster and run Apache on it.
Network Setup

The logical network setup is basically a two-node cluster using the server side replication capability (as opposite as the client side replication). In this fashion the client(s) which will mount the exported mount points only need to worry about serving the data trough an application, Apache in this case.

The HA will be achieved by using round-robin DNS (RRDNS) from the client to servers, that is when the client issues a request to the servers it will do so using a FQDN for the cluster, if one of the server nodes is down it will switch to the other node. When the failed node comes back the self healing properties of Glusterfs will ensure that data is replicated from the remaining node.

The nodes will be configured as follow:
  1.     * Node 0: 10.0.0.1 (client)
  2.     * Node 1: 10.0.0.2
  3.     * Node 2: 10.0.0.3
  4.     * Node 1 and node 2 will have /ha exported
  5.     * The client will mount /ha
复制代码
For my example I only have 1 client but this setup is easily extendable to 2 or more clients as well as two or more servers. My setup is of course nothing new is based on several examples at the Glusterfs-Wiki, I took a couple of them and redid the configuration to suite my test needs. In particular you should look at;
    * High-availability storage using server-side AFR by Daniel Maher
    * Setting up AFR on two servers with server side replication by Brandon Lamb
Prep Work

Glusterfs has some dependencies:
   1. Infiniband support if you use an infiniband network
   2. Berkeley DB support
   3. Extended attributes support for the backend system (exported filesystem) ext3 in our case.
   4. FUSE is the primary requirement for Glusterfs.
I will be using CentOS 5.2, the first 3 requirements do come standard with it, Fuse does not so you need to install it.

Install FUSE

Fuse is not available from the standard repository so you need to get it from RPMFORGE.

Follow the instructions and install support for the repository and YUM. Make sure to also configure YUM to use the priorities plug-in. This makes sure that the standard repositories are used before the RPMFORGE repositories if you use automatic updates or if you want to install an update for a particular package without breaking anything on your system.

When you have the repository support installed issue at a command prompt the following:
  1. # yum -y install fuse fuse-devel
复制代码
The command above will install fuse and libraries plus any other packages needed by it.

Install the FUSE Kernel Module

Make sure you have the kernel-devel package for your kernel. At a prompt issue:
  1. # yum info kernel-devel
复制代码
You should see something like this:

Installed Packages
Name                 : kernel-devel
Arch                 : i686
Version         : 2.6.18
Release         : 92.el5
Size                 : 15 M
Repo                 : installed
Summary         : Development package for building kernel modules to match the kernel.
Description         : This package provides kernel headers and makefiles sufficient to build
                : modules against the kernel package.
If it says “installed” then you are ok otherwise you need to install it. Issue:
yum -y install kernel-devel-2.6.18-92.el5 kernel-headers-2.6.18-92.el5 dkms-fuse
The command above will install the kernel source and install the source for the Fuse kernel module.
  1. Change directories to /usr/src/fuse-2.7.4-1.nodist.rf and issue: ./configure; make install;
复制代码
This will install the fuse.ko kernel module. Finally do a chkconfig fuse on to enable fuse support at boot up. Start the fuse service: service fuse start.

Repeat the above procedure on all 3 nodes (the client and the 2 servers).

Install Glusterfs

Download the latest here (1.3.9). If you are using a 64 bit architecture get the corresponding RPMs otherwise you will have to compile the RPMs from the SRPM using the following command: rpmbuild –rebuild glusterfs-1.3.9-1.src.rpm, this will create the following in /usr/src/redhat/RPMS/i386:
    * glusterfs-1.3.9-1.i386.rpm
    * glusterfs-devel-1.3.9-1.i386.rpm
    * glusterfs-debuginfo-1.3.9-1.i386.rpm
Copy the files to the other nodes and install the RPMs with: rpm -ivh glusterfs*. Verify the installation was successful by issuing:
  1. # glusterfs --version
复制代码
No errors should be reported.

Round-Robin DNS

A key component of the HA setup is RRDNS. Though it is used only in one instance, it is a critical function - one which helps to ensure that the data can be served continuously even in the event that one of the storage servers becomes inaccessible.

Normally in a standard configuration a client will access the servers via their IP addresses. The major drawback if in this setup is that if the server becomes inaccessible the client will unable to access the data. This can be mitigated by using a hostname rather than addresses to access the servers.

Consider the following:
$ host node1.mycluster.com
node1.mycluster.com has address10.0.0.2
$ host node2.mycluster.com
node2.mycluster.com has address 10.0.0.3
$ host cluster.mycluster.com
cluster.mycluster.com has address 192.168.0.110
cluster.mycluster.com has address 192.168.0.111
$ dig cluster.mycluster.com | grep -A 2 "ANSWER SECTION"
;; ANSWER SECTION:
cluster.mycluster.com. 3600 IN A 10.0.0.2
cluster.mycluster.com. 3600 IN A 10.0.0.3
So you need to configure the zone file for the mycluster.com zone to serve the corresponding records for all the server nodes and for the FQDN of the cluster. Configuration and setup of such DNS zone is well documented on the Internet so it will be left as an exercise for the reader.
Configure Glusterfs

Now that all the prep work, RPMS and RRDNS is in place we are ready to configure the cluster. The main key in the setup is the “AFR translator”; this is the mechanism that ensures data (“subvolumes”) is replicated between servers.

The reader is encouraged to visit the Gluster-Wiki and go over the fundamentals of Glusterfs and in particular over the performance options used in this setup (readahead, writeback, cache-size, etc).

Node 1

The following is the configuration of node 1:
  1. [root@node1 ~]# more /etc/glusterfs/glusterfs-server.vol
  2. # Dataspce on Node1
  3. volume gfs-ds
  4. type storage/posix
  5. option directory /ha
  6. end-volume
  7. # posix locks
  8. volume gfs-ds-locks
  9. type features/posix-locks
  10. subvolumes gfs-ds
  11. end-volume
  12. # Dataspace on Node2
  13. volume gfs-node2-ds
  14. type protocol/client
  15. option transport-type tcp/client
  16. option remote-host 10.0.0.3 # IP address of node2
  17. option remote-subvolume gfs-ds-locks
  18. option transport-timeout 5 # value in seconds
  19. end-volume
  20. # automatic file replication translator for dataspace
  21. volume gfs-ds-afr
  22. type cluster/afr
  23. subvolumes gfs-ds-locks gfs-node2-ds
  24. end-volume
  25. # the actual exported volume
  26. volume gfs
  27. type performance/io-threads
  28. option thread-count 8
  29. option cache-size 64MB
  30. subvolumes gfs-ds-afr
  31. end-volume
  32. volume server
  33. type protocol/server
  34. option transport-type tcp/server
  35. subvolumes gfs
  36. option auth.ip.gfs-ds- locks.allow 10.0.0.*,127.0.0.1
  37. option auth.ip.gfs.allow 10.0.0.*,127.0.0.1
  38. end-volume
复制代码
Node 2

The configuration of node 2:
  1. [root@node2 ~]# more /etc/glusterfs/glusterfs-server.vol
  2. # Dataspace on Node2
  3. volume gfs-ds
  4. type storage/posix
  5. option directory /ha
  6. end-volume
  7. # posix locks
  8. volume gfs-ds-locks
  9. type features/posix-locks
  10. subvolumes gfs-ds
  11. end-volume
  12. # Dataspace on Node1
  13. volume gfs-node1-ds
  14. type protocol/client
  15. option transport-type tcp/client
  16. option remote-host 10.0.0.2 # IP address of server1
  17. option remote-subvolume gfs-ds-locks
  18. option transport-timeout 5 # value in seconds
  19. end-volume
  20. # automatic file replication translator for dataspace
  21. volume gfs-ds-afr
  22. type cluster/afr
  23. subvolumes gfs-ds-locks gfs-node1-ds
  24. end-volume
  25. # the actual exported volume
  26. volume gfs
  27. type performance/io-threads
  28. option thread-count 8
  29. option cache-size 64MB
  30. subvolumes gfs-ds-afr
  31. end-volume
  32. volume server
  33. type protocol/server
  34. option transport-type tcp/server
  35. subvolumes gfs
  36. option auth.ip.gfs-ds-locks.allow 10.0.0.*,127.0.0.1
  37. option auth.ip.gfs.allow 10.0.0.*,127.0.0.1
  38. end-volume
复制代码
Client

Finally the client configuration:
  1. [root@node0 ~]# more /etc/glusterfs/glusterfs-client.vol
  2. # the exported volume to mount
  3. volume cluster
  4. type protocol/client
  5. option transport-type tcp/client # For TCP/IP transport
  6. option remote-host cluster.yourdomain.com # FQDN of server
  7. option remote-subvolume gfs # Exported volmue
  8. option transport-timeout 10 # Value in seconds
  9. end-volume
  10. # performance block for cluster # optional!
  11. volume writeback
  12. type performance/write-behind
  13. option aggregate-size 131072
  14. subvolumes cluster
  15. end-volume
  16. # performance block for cluster # optional!
  17. volume readahead
  18. type performance/read-ahead
  19. option page-size 65536
  20. option page-count 16
  21. subvolumes writeback
  22. end-volume
复制代码
Start Gluster on Servers and Clients

On both servers make sure:
    * /ha exist
    * If /ha is a mount point make sure the file system has been created (in our case ext3)
    * Mount /ha on both servers
    * The configuration files exist
On the client make sure:
    * The fuse service is started and the kernel module is loaded.
    * The client configuration exist
Make sure the FDQN cluster.mydomain.com resolves to both addresses (10.0.0.2, 10.0.0.3). Finally make sure the clock is synchronized on all three nodes.

Start gluster on both servers:
  1. [root@node2 ~]# glusterfsd -f /etc/glusterfs/glusterfs-server.vol
  2. [root@node2 ~]# tail /var/log/glusterfsd.log
复制代码
Start Gluster on the client and mount /ha:
  1. [root@node0 ~]# glusterfs -f /etc/glusterfs/glusterfs-client.vol /ha
  2. [root@node0 ~]# cd /ha
  3. [root@node0 ~]# ls -l
复制代码
Test that things are working ok by creating a new file:
  1. [root@node0 ~]# cd /ha
  2. [root@node0 ~]# touch file.txt
  3. [root@node0 ~]# ls
  4. file.txt
复制代码
Now go to both servers and see if the file was created on both servers under /ha. If the file exists on both congratulations!
Configure Apache

The configuration of Apache is identical to the one we did for our two-node gnbd cluster. Copy all the files to the new root which will be /ha then on the client (node 0) change the configuration file to point to the correct root of the web server (i.e. /ha).

When done you should be able to go to the IP of the client via a web browser and get the Apache default page.
Testing Failover

To test failover simulate a failure of one of the nodes, for our test we will choose node 1. So issue:
  1. [root@node1 ~]# killall glusterfsd
复制代码
Wait a few seconds and then browse to the IP of node 0. You should get a response if you do congrats once more! You have a working two-node cluster.

Test recovery by restarting gluster on node 1, before you do touch a file on the client:
  1. [root@node0 ~]# echo "testing failure" > /ha/test.txt
复制代码
The file should appear on node 2 but node 1 should not have it. Start gluster on node 1 after it starts you will notice that the file “test.txt” does not appear in /ha on node 1 even though the cluster is once again up.

Self-healing will eventually synchronize the files from node 2 to node 1 when the files that are new are accessed on node 2, if you want to force self-healing to happen then use a script that access those files that have changed or simply do a:
  1. [root@node2 ~]# ls -lR
复制代码
The above will force self-healing, after you do this “test.txt” will appear on node 1.
Conclusions

The setup presented here will give you a two-node cluster using server-to-server replication and a client that will access the data. The setup can easily scale up to have 2 o more servers and 2 or more clients.

Of course in a production environment you should:

    * Used bonded interfaces on all nodes (servers and clients).
    * Used a dedicated network for cluster communication, see the article by Daniel Maher at the beginning of this post
    * Used 64 bit architecture servers and dedicated storage for each. This will improve performance.
    * Gluster recommends using client-server replication instead of server-server replication; however I believe there are advantages to using server-server replication and freeing the client from doing anything other than do what a client does just access data.
    * Release 4.0 now has added HA as a translator thus IP addresses can be used now also as elements for failover, with this new translator having an internal DNS to resolve FQDN for the cluster becomes a non-issue since you will be able to use the HA IP address of the cluster instead.
    * Use CARP or Heartbeat on the client side to give you additional HA on the client side.

Finally a setup like this gives you an inexpensive way of creating a cluster without the cost of a SAN for example. The Gluster-Wiki has some examples of people using it in production environments.

http://blog.miguelsarmiento.com/ ... lusterfs-and-centos

REFERENCE
http://www.thismail.org/bbs/viewthread.php?tid=3180

HA cluster with DRBD and Heartbeat

SkyHi @ Friday, January 22, 2010
This article shows how to setup a OpenVZ high availability (HA) cluster using the data replication software DRBD and the cluster manager Heartbeat. In this example the two machines building the cluster run on CentOS 4.3. The article also shows how to do kernel updates in the cluster, including necessary steps like recompiling of new DRBD userspace tools. For this purpose, kernel 2.6.8-022stab078.10 (containing DRBD module 0.7.17) is used as initial kernel version, and kernel 2.6.8-022stab078.14 (containing DRBD module 0.7.20) as updated kernel version.
Update: this howto currently does not describe details on OpenVZ Kernel 2.6.18, which contains DRBD version 8.*. Meanwhile, some hints on using OpenVZ Kernel 2.6.18 with DRBD 8 can be found in this thread in the forum.
Additional information about clustering of virtual machines can be found in the following paper: (PDF, 145K)
Some other additional information can be found in the documentation of the Thomas-Krenn.AG cluster (The author of this howto is working in the cluster development there, that is the reason why he was able to write this howto :-). The full documentation with interesting illustrations is currently only available in German:
An excellent presentation and overview by Werner Fischer, Thomas-Krenn.AG is available here http://www.profoss.eu/index.php/main/content/download/355/3864/file/werner-fischer.pdf.


Contents

[hide]

[edit] Prerequisites

The OpenVZ kernel already includes the DRBD module. The DRBD userspace tools and the cluster manager Heartbeat must be provided seperately. As the API version of the DRBD userspace tools must exactly match the API version of the module, compile them yourself. Also compile Heartbeat yourself, as at the time of this writing the CentOS extras repository only contained an old CVS version of Heartbeat.
On a hardware node for production use there should not be any application that is not really needed for running OpenVZ (any things which are not needed by OpenVZ should run in a VE for security reasons). As a result, compile DRBD and Heartbeat on another machine running CentOS 4.3 (in this example I used a virtual machine on a VMware Server).

[edit] Compiling Heartbeat

Heartbeat version 1.2.* has successfully been used in a lot of two-node-clusters around the world. As the codebase used in version 1.2.* is in production use for many years now, the code is very stable. At the time of writing, Heartbeat version 1.2.4 is the current version of the 1.2.* branch.
Get the tar.gz of the current version of the 1.2.* branch from http://linux-ha.org/download/index.html, at the time of this writing this is http://linux-ha.org/download/heartbeat-1.2.4.tar.gz. Use rpmbuild to build the package:
rpmbuild -ta heartbeat-1.2.4.tar.gz
After that, you find four rpm packes in /usr/src/redhat/RPMS/i386 (heartbeat-1.2.4-1.i386.rpm, heartbeat-ldirectord-1.2.4-1.i386.rpm, heartbeat-pils-1.2.4-1.i386.rpm, heartbeat-stonith-1.2.4-1.i386.rpm). In this example only heartbeat-1.2.4-1.i386.rpm, heartbeat-pils-1.2.4-1.i386.rpm, and heartbeat-stonith-1.2.4-1.i386.rpm are needed.

[edit] Compiling DRBD userspace tools

When compiling the DRBD userspace tools, you have to take care to take the version that matches the DRBD version that is included in the OpenVZ kernel you want to use. If you are unsure about the version, do the following steps while running the OpenVZ kernel that you want to use on a test machine (I used another virtual machine on a VMware server to try this):
[root@testmachine ~]# cat /proc/version
Linux version 2.6.8-022stab078.10 (root@rhel4-32) (gcc version 3.4.5 20051201 (Red Hat 3.4.5-2)) #1 Wed Jun 21 12:01:20 MSD 2006
[root@testmachine ~]# modprobe drbd
[root@testmachine ~]# cat /proc/drbd
version: 0.7.17 (api:77/proto:74)
SVN Revision: 2093 build by phil@mescal, 2006-03-06 15:04:12
 0: cs:Unconfigured
 1: cs:Unconfigured
[root@testmachine ~]# rmmod drbd
[root@testmachine ~]#
Here the version of the DRBD module is 0.7.17. So the userspace tools for 0.7.17 are neccessary.
Back on the buildmachine, do the following to create the rpm:
[root@buildmachine ~]# yum install kernel-devel gcc bison flex
Setting up Install Process
Setting up repositories
Reading repository metadata in from local files
Parsing package install arguments
Nothing to do
[root@buildmachine ~]# tar xfz drbd-0.7.17.tar.gz
[root@buildmachine ~]# cd drbd-0.7.17
[root@buildmachine drbd-0.7.17]# make rpm
[...]
You have now:
-rw-r--r--  1 root root 288728 Jul 30 10:40 dist/RPMS/i386/drbd-0.7.17-1.i386.rpm
-rw-r--r--  1 root root 518369 Jul 30 10:40 dist/RPMS/i386/drbd-km-2.6.9_34.0.2.EL-0.7.17-1.i386.rpm
[root@buildmachine drbd-0.7.17]#
Note that in this way the kernel-devel from CentOS is used, but this does not matter as the created drbd-km rpm will not be used (the DRBD kernel module is already included in OpenVZ kernel). If the kernel-devel package is not the same version as the kernel package that is currently running, it is possible to execute 'make rpm KDIR=/usr/src/kernels/2.6.9-34.0.2.EL-i686/' to directly point to the kernel sources.

[edit] Installing the two nodes

Install the two machines in the same way as you would install them for a normal OpenVZ installation, but do not create a filesystem for the /vz. This filesystem will be installed later on on top of DRBD.
Example installation configuration
Parameter node1 node2
hostname ovz-node1 ovz-node2
/ filesystem hda1, 10 GB hda1, 10 GB
swap space hda2, 2048 MB hda2, 2048 MB
public LAN eth0, 192.168.1.201 eth0, 192.168.1.202
private LAN eth1, 192.168.255.1 (Gbit Ethernet) eth1, 192.168.255.2 (Gbit Ethernet)
other install options no firewall, no SELinux no firewall, no SELinuxt vim-enhanced

[edit] Installing OpenVZ

Get the OpenVZ kernel and utilities and install them on both nodes, as described in quick installation. Update grub configuration to use the OpenVZ kernel by default. Disable starting of OpenVZ on system boot on both nodes (OpenVZ will be started and stopped by Heartbeat):
[root@ovz-node1 ~]# chkconfig vz off
[root@ovz-node1 ~]# 
Then reboot both machines.

[edit] Setting up DRBD

On each of the two nodes create a partition that acts as underlying DRBD device. The partitions should have exactly the same size (I created a 10 GB partition hda3 using fdisk on each node for this example). Note that it might be necessary to reboot the machines to re-read the partition table.
Install the rpm of the DRBD userspace tools on both nodes:
[root@ovz-node1 ~]# rpm -ihv drbd-0.7.17-1.i386.rpm
Preparing...                ########################################### [100%]
   1:drbd                   ########################################### [100%]
[root@ovz-node1 ~]#
Then create the drbd.conf configuration file and copy it to /etc/drbd.conf on both nodes. Below is the example configuration file that is used in this article:
resource r0 {
  protocol C;
  incon-degr-cmd "echo '!DRBD! pri on incon-degr' | wall ; sleep 60 ; halt -f";

  startup {
    degr-wfc-timeout 120;
  }

  net {
    on-disconnect reconnect;
  }

  disk {
    on-io-error   detach;
  }

  syncer {
    rate 30M;
    group 1;
    al-extents 257;
  }

  on ovz-node1 {
    device     /dev/drbd0;
    disk       /dev/hda3;
    address    192.168.255.1:7788;
    meta-disk  internal;
  }

  on ovz-node2 {
    device     /dev/drbd0;
    disk       /dev/hda3;
    address    192.168.255.2:7788;
    meta-disk  internal;
  }

}
Start DRBD on both nodes:
[root@ovz-node1 ~]# /etc/init.d/drbd start
Starting DRBD resources:    [ d0 s0 n0 ].
[root@ovz-node1 ~]# 
Then check the status of /proc/drbd:
[root@ovz-node1 ~]# cat /proc/drbd
version: 0.7.17 (api:77/proto:74)
SVN Revision: 2093 build by phil@mescal, 2006-03-06 15:04:12
 0: cs:Connected st:Secondary/Secondary ld:Inconsistent
    ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0
[root@ovz-node1 ~]#
Both nodes are now Secondary and Inconsistent. The latter is because the underlying storage is not yet in-sync, and DRBD has no way to know whether you want the initial sync from ovz-node1 to ovz-node2, or ovz-node2 to ovz-node1. As there is no data below it yet, it does not matter.
To start the sync from ovz-node1 to ovz-node2, do the following on ovz-node1:
[root@ovz-node1 ~]# drbdadm -- --do-what-I-say primary all
[root@ovz-node1 ~]# cat /proc/drbd
version: 0.7.17 (api:77/proto:74)
SVN Revision: 2093 build by phil@mescal, 2006-03-06 15:04:12
 0: cs:SyncSource st:Primary/Secondary ld:Consistent
    ns:627252 nr:0 dw:0 dr:629812 al:0 bm:38 lo:640 pe:0 ua:640 ap:0
        [=>..................] sync'ed:  6.6% (8805/9418)M
        finish: 0:04:51 speed: 30,888 (27,268) K/sec
[root@ovz-node1 ~]#
As you see, DRBD syncs with about 30 MB per second, as we told it so in /etc/drbd.conf. On the SyncSource (ovz-node1 in this case) the DRBD device is already useable (although it is syncing in the background).
So you can immediately create the filesystem:
[root@ovz-node1 ~]# mkfs.ext3 /dev/drbd0
[...]
[root@ovz-node1 ~]# 

[edit] Copy necessary OpenVZ files to DRBD device

Move the original /vz directory to /vz.orig and recreate the /vz directory to have it as a mount point (do this on both nodes):
[root@ovz-node1 ~]# mv /vz /vz.orig
[root@ovz-node1 ~]# mkdir /vz
[root@ovz-node1 ~]#
Afterwards move the necessary OpenVZ directories (/etc/vz, /etc/sysconfig/vz-scripts, /var/vzquota) and replace them with symbolic links (do this on both nodes):
[root@ovz-node1 ~]# mv /etc/vz /etc/vz.orig
[root@ovz-node1 ~]# mv /etc/sysconfig/vz-scripts /etc/sysconfig/vz-scripts.orig
[root@ovz-node1 ~]# mv /var/vzquota /var/vzquota.orig
[root@ovz-node1 ~]# ln -s /vz/cluster/etc/vz /etc/vz
[root@ovz-node1 ~]# ln -s /vz/cluster/etc/sysconfig/vz-scripts /etc/sysconfig/vz-scripts
[root@ovz-node1 ~]# ln -s /vz/cluster/var/vzquota /var/vzquota
[root@ovz-node1 ~]#
Currently, ovz-node1 is still Primary of /dev/drbd0. You can now mount it and copy the necessary files to it (only on ovz-node1!):
[root@ovz-node1 ~]# mount /dev/drbd0 /vz
[root@ovz-node1 ~]# cp -a /vz.orig/* /vz/
[root@ovz-node1 ~]# mkdir -p /vz/cluster/etc
[root@ovz-node1 ~]# mkdir -p /vz/cluster/etc/sysconfig
[root@ovz-node1 ~]# mkdir -p /vz/cluster/var
[root@ovz-node1 ~]# cp -a /etc/vz.orig /vz/cluster/etc/vz/
[root@ovz-node1 ~]# cp -a /etc/sysconfig/vz-scripts.orig /vz/cluster/etc/sysconfig/vz-scripts
[root@ovz-node1 ~]# cp -a /var/vzquota.orig /vz/cluster/var/vzquota
[root@ovz-node1 ~]# umount /dev/drbd0
[root@ovz-node1 ~]#

[edit] Setting up Heartbeat

Install the neccessary Heartbeat rpms on both nodes:
[root@ovz-node1 ~]# rpm -ihv heartbeat-1.2.4-1.i386.rpm heartbeat-pils-1.2.4-1.i386.rpm heartbeat-stonith-1.2.4-1.i386.rpm
Preparing...                ########################################### [100%]
   1:heartbeat-pils         ########################################### [ 33%]
   2:heartbeat-stonith      ########################################### [ 67%]
   3:heartbeat              ########################################### [100%]
[root@ovz-node1 ~]#
Create the Heartbeat configuration file ha.cf and copy it to /etc/ha.d/ha.cf on both nodes. Details about this file can be found at http://www.linux-ha.org/ha.cf. Below is an example configuration which uses the two network connections and also a serial connection for heartbeat packets:
# Heartbeat logging configuration
logfacility daemon

# Heartbeat cluster members
node ovz-node1
node ovz-node2

# Heartbeat communication timing
keepalive 1
warntime 10
deadtime 30
initdead 120

# Heartbeat communication paths
udpport 694
ucast eth1 192.168.255.1
ucast eth1 192.168.255.2
ucast eth0 192.168.1.201
ucast eth0 192.168.1.202
baud 19200
serial /dev/ttyS0

# Don't fail back automatically
auto_failback off

# Monitoring of network connection to default gateway
ping 192.168.1.1
respawn hacluster /usr/lib64/heartbeat/ipfail
Create the Heartbeat configuration file authkeys and copy it to /etc/ha.d/authkeys on both nodes. Set the permissions of this file to 600. Details about this file can be found at http://www.linux-ha.org/authkeys. Below is an example:
auth 1
1 sha1 PutYourSuperSecretKeyHere
Create the Heartbeat configuration file haresources and copy it to /etc/ha.d/haresources on both nodes. Details about this file can be found at http://www.linux-ha.org/haresources. Note that it is not necessary to configure IPs for gratuitous arp here. The gratuitous arp is done by OpenVZ itself, through /etc/sysconfig/network-scripts/ifup-venet and /usr/lib/vzctl/scripts/vps-functions. Below is an example for the haresources file:
ovz-node1 drbddisk::r0 Filesystem::/dev/drbd0::/vz::ext3 vz MailTo::youremail@yourdomain.tld
Finally, you can now start heartbeat on both nodes:
[root@ovz-node1 ~]# /etc/init.d/heartbeat start
Starting High-Availability services:
                                                           [  OK  ]
[root@ovz-node1 ~]#

[edit] Before going in production: testing, testing, testing, and ...hm... testing!

The installation of the cluster is finished at this point. Before putting the cluster in production it is very important to test the cluster. Because of all the possible different kinds of hardware that you may have, you may encounter problems when a failover is necessary. And as the cluster is about high availability, such problems must be found before the cluster is used for production.
Here is one example: The e1000 driver that is included in kernels < 2.6.12 has a problem when a cable gets unplugged while broadcast packets are still being sent out on that interface. When using broadcast communication in Heartbeat on a crossover link, this fills up the transmit ring buffer on the adapter (the buffer is full after about 8 minutes after the cable got unplugged). Using unicast communication in Heartbeat fixes the problem for example. Details see: http://www.osdl.org/developer_bugzilla/show_bug.cgi?id=699#c22
Without testing you may not be aware of such problems and may face them when the cluster is in production and a failover would be necessary. So test your cluster carefully!
Possible tests can include:
  • power outage test of active node
  • power outage test of passive node
  • network connection outage test of eth0 of active node
  • network connection outage test of eth0 of passive node
  • network connection outage test of crossover network connection
  • ...
As mentioned above, some problems only arise after an outage lasts longer than some minutes. So do the tests also with a duration of >1h for example.
Before you start to test, build a test plan. Some valueable information on that can be found in chapter 3 "Testing a highly available Tivoli Storage Manager cluster environment" of the Redbook IBM Tivoli Storage Manager in a Clustered Environment, see http://www.redbooks.ibm.com/abstracts/sg246679.html. In this chapter it is mentioned that the experience of the authoring team is that the testing phase must be at least two times the total implementation time for the cluster.

[edit] Before installing kernel updates: testing again

New OpenVZ kernel often include driver updates. This kernel for examples includes an update of the e1000 module: http://openvz.org/news/updates/kernel-022stab078.21
To avoid to overlook problems with new components (such as a newer kernel), it is necessary to re-do the tests mentioned above. But as the cluster is already in production, a second cluster (test cluster) with the same hardware as the main cluster is needed. Use this test cluster to test updates of the kernel or main OS updates for the hardware node before putting them on the production cluster.
I know this is not an easy task, as it is time-consuming and needs additional hardware only for testing. But when really business-critical applications are running on the cluster, it is very good to now that the cluster works fine also with new updates installed on the hardware node. In many cases a dedicated test cluster and the time efford for the testing of updates may cause too much costs. If you cannot do such test of updates, keep in mind that over time (when you must install security updates of the OS or the kernel) you have a cluster that you have not tested in this configuration.
If you need a tested cluster (also with tested kernel updates), you may take a look on this Virtuozzo cluster: http://www.thomas-krenn.com/cluster

[edit] How to do OpenVZ kernel updates when it contains a new DRBD version

As mentioned above, it is important to use the correct version of the DRBD userspace tools. When an OpenVZ kernel contains a new DRBD version, it is important that the DRBD API version of the userspace tools matches the API version of the DRBD module that is included in the OpenVZ kernel. The API versions can be found at http://svn.drbd.org/drbd/branches/drbd-0.7/ChangeLog. The best way is to always use the version of the DRBD userspace tools that matches the version of the DRBD module that is included in the OpenVZ kernel.
In this example the initial cluster installation contained OpenVZ kernel 2.6.8-022stab078.10, which contains the DRBD module 0.7.17. The steps below show the update procedure to OpenVZ kernel 2.6.8-022stab078.14, which contains the DRBD module 0.7.20. In the first step build the DRBD userspace tools version 0.7.20 on your buildmachine. Then stop Heartbeat and DRBD on the passive node (hint: you can use 'cat /proc/drbd' to get a hint which node is active and which one is passive):
[root@ovz-node2 ~]# cat /proc/drbd
version: 0.7.17 (api:77/proto:74)
SVN Revision: 2093 build by phil@mescal, 2006-03-06 15:04:12
 0: cs:Connected st:Secondary/Primary ld:Consistent
    ns:60 nr:136 dw:196 dr:97 al:3 bm:3 lo:0 pe:0 ua:0 ap:0
[root@ovz-node2 ~]# /etc/init.d/heartbeat stop
Stopping High-Availability services:
                                                           [  OK  ]
[root@ovz-node2 ~]# cat /proc/drbd
version: 0.7.17 (api:77/proto:74)
SVN Revision: 2093 build by phil@mescal, 2006-03-06 15:04:12
 0: cs:Connected st:Secondary/Primary ld:Consistent
    ns:60 nr:136 dw:196 dr:97 al:3 bm:3 lo:0 pe:0 ua:0 ap:0
[root@ovz-node2 ~]# /etc/init.d/drbd stop
Stopping all DRBD resources.
[root@ovz-node2 ~]# cat /proc/drbd
cat: /proc/drbd: No such file or directory
[root@ovz-node2 ~]#
Then install the new kernel and the DRBD userspace tools on this node:
[root@ovz-node2 ~]# rpm -ihv ovzkernel-2.6.8-022stab078.14.i686.rpm
warning: ovzkernel-2.6.8-022stab078.14.i686.rpm: V3 DSA signature: NOKEY, key ID a7a1d4b6
Preparing...                ########################################### [100%]
   1:ovzkernel              ########################################### [100%]
[root@ovz-node2 ~]# rpm -Uhv drbd-0.7.20-1.i386.rpm
Preparing...                ########################################### [100%]
   1:drbd                   ########################################### [100%]
/sbin/service
Stopping all DRBD resources.
[root@ovz-node2 ~]#
Now set the new kernel as default kernel in /etc/grub.conf and then reboot this node.
After the reboot, the new DRBD version is visible:
[root@ovz-node2 ~]# cat /proc/drbd
version: 0.7.20 (api:79/proto:74)
SVN Revision: 2260 build by phil@mescal, 2006-07-04 15:18:57
 0: cs:Connected st:Secondary/Primary ld:Consistent
    ns:0 nr:28 dw:28 dr:0 al:0 bm:2 lo:0 pe:0 ua:0 ap:0
[root@ovz-node2 ~]#
To update the other node, switch-over the services to make the current active node the passive node. Execute the following on the still active node (it could be that the hb_standby command is located in /usr/lib/heartbeat):
[root@ovz-node1 ~]# /usr/lib64/heartbeat/hb_standby
2006/08/03_21:09:41 Going standby [all].
[root@ovz-node1 ~]#
Now do the same steps on the new passive node to update it: stop Heartbeat and DRBD, install the new kernel and the new DRBD userspace tools, set the new kernel as default kernel in /etc/grub.conf and reboot the node.

[edit] How to do updates of vzctl, vzctl-lib, and vzquota

Ensure after every update of OpenVZ tools that OpenVZ is not started on system boot. To disable starting of OpenVZ on system boot execute on both nodes:
[root@ovz-node1 ~]# chkconfig vz off
[root@ovz-node1 ~]# 

[edit] Live-Switchover with the help of checkpointing

With the help of checkpointing it is possible to do live switchovers.
Important: although this HOWTO currently describes the use of DRBD 0.7, it is necessary to use DRBD 8 to be able to use this live-switchover feature reliable. Some hints on using OpenVZ Kernel 2.6.18 with DRBD 8 can be found in this thread in the forum.
The following scripts are written by Thomas Kappelmueller. They should be placed at /root/live-switchover/ on both nodes. To activate the scripts execute the following commands on both nodes:
[root@ovz-node1 ~]# ln -s /root/live-switchover/openvz /etc/init.d/
[root@ovz-node1 ~]# ln -s /root/live-switchover/live_switchover.sh /root/bin/
[root@ovz-node1 ~]# 
It is also necessary to replace vz by an adjusted initscript (openvz in this example). So /etc/ha.d/haresources has the following content on both nodes:
ovz-node1 drbddisk::r0 Filesystem::/dev/drbd0::/vz::ext3 openvz MailTo::youremail@yourdomain.tld

[edit] Script cluster_freeze.sh

#!/bin/bash
#Script by Thomas Kappelmueller
#Version 1.0
LIVESWITCH_PATH='/vz/cluster/liveswitch'

if [ -f $LIVESWITCH_PATH ]
then
        rm -f $LIVESWITCH_PATH
fi

RUNNING_VE=$(vzlist -1)

for I in $RUNNING_VE
do
        BOOTLINE=$(cat /etc/sysconfig/vz-scripts/$I.conf | grep -i "^onboot")
        if [ $I != 1 -a "$BOOTLINE" = "ONBOOT=\"yes\"" ]
        then
                vzctl chkpnt $I

                if [ $? -eq 0 ]
                then
                        vzctl set $I --onboot no --save
                        echo $I >> $LIVESWITCH_PATH
                fi
        fi
done

exit 0

[edit] Script cluster_unfreeze.sh

#!/bin/bash
#Script by Thomas Kappelmueller
#Version 1.0

LIVESWITCH_PATH='/vz/cluster/liveswitch'

if [ -f $LIVESWITCH_PATH ]
then
        FROZEN_VE=$(cat $LIVESWITCH_PATH)
else
        exit 1
fi

for I in $FROZEN_VE
do
        vzctl restore $I

        if [ $? != 0 ]
        then
                vzctl start $I
        fi

        vzctl set $I --onboot yes --save
done

rm -f $LIVESWITCH_PATH

exit 0

[edit] Script live_switchover.sh

#!/bin/bash
#Script by Thomas Kappelmueller
#Version 1.0

ps -eaf | grep 'vzctl enter' | grep -v 'grep' > /dev/null
if [ $? -eq 0 ]
then
  echo 'vzctl enter is active. please finish before live switchover.'
  exit 1
fi
ps -eaf | grep 'vzctl exec' | grep -v 'grep' > /dev/null
if [ $? -eq 0 ]
then
  echo 'vzctl exec is active. please finish before live switchover.'
  exit 1
fi
echo "Freezing VEs..."
/root/live-switchover/cluster_freeze.sh
echo "Starting Switchover..."
/usr/lib64/heartbeat/hb_standby

[edit] Script openvz

#!/bin/bash
#
# openvz        Startup script for OpenVZ
#

start() {
        /etc/init.d/vz start > /dev/null 2>&1
        RETVAL=$?
        /root/live-switchover/cluster_unfreeze.sh
        return $RETVAL
}
stop() {
        /etc/init.d/vz stop > /dev/null 2>&1
        RETVAL=$?
        return $RETVAL
}
status() {
        /etc/init.d/vz status > /dev/null 2>&1
        RETVAL=$?
        return $RETVAL
}

# See how we were called.
case "$1" in
  start)
        start
        ;;
  stop)
        stop
        ;;
  status)
        status
        ;;
  *)
        echo $"Usage: openvz {start|stop|status}"
        exit 1
esac

exit $RETVAL


Heartbeat and DRBD

SkyHi @ Friday, January 22, 2010
Heartbeat and DRBD are high-availability solutions. Heartbeat (part of the Linux-HA-Project) manages a cluster of servers and makes sure all tasks are worked on, and DRBD (Distributed Replicated Block Device) is the storage analogy, making sure all data gets sent around and is always online. Together, they can help you with making the damage after a hardware or software fail as small as possible.
The drawback of DRBD currently is that you can only write and read from the primary(master) node, the partition you mirror can not be mounted on the secondary(slave) node, you have to unmount it on the primary node(node1), then inform the drbd client on the secondary node(node2) to run as primary and then mount the partition to read and write the data.

Contents

[hide]

[edit] Assumptions and starting configuration

It is assumed you have two identical Gentoo installations. If you are only here for the information and are not setting up two or more physical boxes, you can run these in a VM like VMWare. Both installations have a public static IP address, and the secondary nic should have some type of private IP address. You will also need an additional public static IP address that will be used for the "service" IP address. Everything relying on the cluster as a whole should use this IP address for services.

[edit] System Configuration

Start by tweaking the network devices. It might be that your current configuration already works.
File: testcluster1: /etc/conf.d/net
# External static interface.
config_eth0=( "192.168.0.101 netmask 255.255.255.0 brd 192.168.0.255" )
routes_eth0=( "default gw 192.168.0.1" )
dns_servers_eth0="4.2.2.1 4.2.2.2"
dns_domain_eth0="yourdomain.tld"

# This is the heartbeat and disk syncing interface.
config_eth1=( "10.0.0.1 netmask 255.255.255.0 brd 10.0.0.255" )
File: testcluster2: /etc/conf.d/net
# External static interface.
config_eth0=( "192.168.0.102 netmask 255.255.255.0 brd 192.168.0.255" )
routes_eth0=( "default gw 192.168.0.1" )
dns_servers_eth0="4.2.2.1 4.2.2.2"
dns_domain_eth0="yourdomain.tld"

# This the heartbeat and disk syncing interface.
config_eth1=( "10.0.0.2 netmask 255.255.255.0 brd 10.0.0.255" )
File: both machines: /etc/hosts
# IPv4 and IPv6 localhost aliases
127.0.0.1            localhost.localdomain localhost
192.168.0.100         testcluster.yourdomain.tld testcluster
192.168.0.101         testcluster1.yourdomain.tld testcluster1
192.168.0.102         testcluster2.yourdomain.tld testcluster2

[edit] Installing and configuring DRBD

[edit] Preparing your HD for DRBD

If you want to use DRBD for mirroring, you should create an extra partition where you save the data you want to mirror to other nodes(e.g. /var/lib/postgresql for postgresql or /var/www for apache). Additionally to the mirrored data DRBD needs at least 128MB to save meta-data. For example, here's how to create an additional virtual disk and put 2 partitions on it, one for Apache and the other for MySQL.
Code: Partition table for DRBD
testcluster1 / # fdisk /dev/sdb

Command (m for help): p

Disk /dev/sdb: 2147 MB, 2147483648 bytes
255 heads, 63 sectors/track, 261 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Disk identifier: 0x00000000

   Device Boot      Start         End      Blocks   Id  System
/dev/sdb1               1         131     1052226   83  Linux
/dev/sdb2             132         261     1044225   83  Linux
To specify a exact partition size, you should change the units to sectors by issuing the command "u"(As explained in the command snippet above). Then create the partitions as explained in the gentoo handbook.
Note: The size can be specified using fdisk with +128M when asked for the ending sector.
Note: You can make exact copy of partition table by using sfdisk. Example: making an exact copy of the partition table using sfdisk
sfdisk -d /dev/sda

[edit] Kernel Configuration

Activate the following options:
Linux Kernel Configuration: Support for bindings
Device Drivers --->
 -- Connector - unified userspace <-> kernelspace linker 

Cryptographic API --->
 -- Cryptographic algorithm manager

[edit] Installing, configuring and running DRBD

Please note that you need to do the following one each cluster. Install DRBD:
testcluster1 / # emerge -av drbd
testcluster2 / # emerge -av drbd
Code: emerge -av drbd
These are the packages that would be merged, in order:

Calculating dependencies... done!
[ebuild  N    ] sys-cluster/drbd-kernel-8.0.13  327 kB
[ebuild  N    ] sys-cluster/drbd-8.0.13  0 kB

Total: 2 packages (2 new), Size of downloads: 327 kB

After you've successfully installed DRBD you'll need to create the configuration file. The following is the complete configuration.
Fix me: There is a lot of redundancy that might be better in the common section. Can someone take a look and optimize the config?
It should be noted that "testcluster1" and "testcluster2" must match the hostname of your boxes.
File: /etc/drbd.conf
global {
        usage-count no;
}

common {
}


#
# this need not be r#, you may use phony resource names,
# like "resource web" or "resource mail", too
#

resource "drbd0" {
        # transfer protocol to use.
        # C: write IO is reported as completed, if we know it has
        #    reached _both_ local and remote DISK.
        #    * for critical transactional data.
        # B: write IO is reported as completed, if it has reached
        #    local DISK and remote buffer cache.
        #    * for most cases.
        # A: write IO is reported as completed, if it has reached
        #    local DISK and local tcp send buffer. (see also sndbuf-size)
        #    * for high latency networks
        #
        protocol C;

        handlers {
                # what should be done in case the cluster starts up in
                # degraded mode, but knows it has inconsistent data.
                #pri-on-incon-degr "echo '!DRBD! pri on incon-degr' | wall ; sleep 60 ; halt -f";

                pri-on-incon-degr "echo 'DRBD: primary requested but inconsistent!' | wall; /etc/init.d/heartbeat stop"; #"halt -f";
                pri-lost-after-sb "echo 'DRBD: primary requested but lost!' | wall; /etc/init.d/heartbeat stop"; #"halt -f";

                #pri-on-incon-degr "echo o > /proc/sysrq-trigger";
                #pri-lost-after-sb "echo o > /proc/sysrq-trigger";
                #local-io-error "echo o > /proc/sysrq-trigger";
        }

        startup {
                #The init script drbd(8) blocks the boot process until the DRBD resources are connected.  When the  cluster  manager
                #starts later, it does not see a resource with internal split-brain.  In case you want to limit the wait time, do it
                #here.  Default is 0, which means unlimited. The unit is seconds.
                wfc-timeout 0;  # 2 minutes

                # Wait for connection timeout if this node was a degraded cluster.
                # In case a degraded cluster (= cluster with only one node left)
                # is rebooted, this timeout value is used.
                #
                degr-wfc-timeout 120;    # 2 minutes.
        }

        syncer {
                rate 100M;
                # This is now expressed with "after res-name"
                #group 1;
                al-extents 257;
        }

        net {
                # TODO: Should these timeouts be relative to some heartbeat settings?
                # timeout       60;    #  6 seconds  (unit = 0.1 seconds)
                # connect-int   10;    # 10 seconds  (unit = 1 second)
                # ping-int      10;    # 10 seconds  (unit = 1 second)

                # if the connection to the peer is lost you have the choice of
                #  "reconnect"   -> Try to reconnect (AKA WFConnection state)
                #  "stand_alone" -> Do not reconnect (AKA StandAlone state)
                #  "freeze_io"   -> Try to reconnect but freeze all IO until
                #                   the connection is established again.
                # FIXME This appears to be obsoleate
                #on-disconnect reconnect;

                # FIXME Experemental Crap
                #cram-hmac-alg "sha256";
                #shared-secret "secretPassword555";
                #after-sb-0pri discard-younger-primary;
                #after-sb-1pri consensus;
                #after-sb-2pri disconnect;
                #rr-conflict disconnect;
        }

        disk {
                # if the lower level device reports io-error you have the choice of
                #  "pass_on"  ->  Report the io-error to the upper layers.
                #                 Primary   -> report it to the mounted file system.
                #                 Secondary -> ignore it.
                #  "panic"    ->  The node leaves the cluster by doing a kernel panic.
                #  "detach"   ->  The node drops its backing storage device, and
                #                 continues in disk less mode.
                #
                on-io-error   pass_on;

                # Under  fencing  we understand preventive measures to avoid situations where both nodes are
                # primary and disconnected (AKA split brain).
                fencing dont-care;

                # In case you only want to use a fraction of the available space
                # you might use the "size" option here.
                #
                # size 10G;
        }


        on testcluster1 {
                device          /dev/drbd0;
                disk            /dev/sdb1;
                address         10.0.0.1:7788;
                meta-disk       internal;
        }

        on testcluster2 {
                device          /dev/drbd0;
                disk            /dev/sdb1;
                address         10.0.0.2:7788;
                meta-disk       internal;
        }
}

resource "drbd1" {
        # transfer protocol to use.
        # C: write IO is reported as completed, if we know it has
        #    reached _both_ local and remote DISK.
        #    * for critical transactional data.
        # B: write IO is reported as completed, if it has reached
        #    local DISK and remote buffer cache.
        #    * for most cases.
        # A: write IO is reported as completed, if it has reached
        #    local DISK and local tcp send buffer. (see also sndbuf-size)
        #    * for high latency networks
        #
        protocol C;

        handlers {
                # what should be done in case the cluster starts up in
                # degraded mode, but knows it has inconsistent data.
                #pri-on-incon-degr "echo '!DRBD! pri on incon-degr' | wall ; sleep 60 ; halt -f";

                pri-on-incon-degr "echo 'DRBD: primary requested but inconsistent!' | wall; /etc/init.d/heartbeat stop"; #"halt -f";
                pri-lost-after-sb "echo 'DRBD: primary requested but lost!' | wall; /etc/init.d/heartbeat stop"; #"halt -f";

                #pri-on-incon-degr "echo o > /proc/sysrq-trigger";
                #pri-lost-after-sb "echo o > /proc/sysrq-trigger";
                #local-io-error "echo o > /proc/sysrq-trigger";
        }

        startup {
                #The init script drbd(8) blocks the boot process until the DRBD resources are connected.  When the  cluster  manager
                #starts later, it does not see a resource with internal split-brain.  In case you want to limit the wait time, do it
                #here.  Default is 0, which means unlimited. The unit is seconds.
                wfc-timeout 0;  # 2 minutes

                # Wait for connection timeout if this node was a degraded cluster.
                # In case a degraded cluster (= cluster with only one node left)
                # is rebooted, this timeout value is used.
                #
                degr-wfc-timeout 120;    # 2 minutes.
        }

        syncer {
                rate 100M;
                # This is now expressed with "after res-name"
                #group 1;
                al-extents 257;
        }

        net {
                # TODO: Should these timeouts be relative to some heartbeat settings?
                # timeout       60;    #  6 seconds  (unit = 0.1 seconds)
                # connect-int   10;    # 10 seconds  (unit = 1 second)
                # ping-int      10;    # 10 seconds  (unit = 1 second)

                # if the connection to the peer is lost you have the choice of
                #  "reconnect"   -> Try to reconnect (AKA WFConnection state)
                #  "stand_alone" -> Do not reconnect (AKA StandAlone state)
                #  "freeze_io"   -> Try to reconnect but freeze all IO until
                #                   the connection is established again.
                # FIXME This appears to be obsoleate
                #on-disconnect reconnect;

                # FIXME Experemental Crap
                #cram-hmac-alg "sha256";
                #shared-secret "secretPassword555";
                #after-sb-0pri discard-younger-primary;
                #after-sb-1pri consensus;
                #after-sb-2pri disconnect;
                #rr-conflict disconnect;
        }

        disk {
                # if the lower level device reports io-error you have the choice of
                #  "pass_on"  ->  Report the io-error to the upper layers.
                #                 Primary   -> report it to the mounted file system.
                #                 Secondary -> ignore it.
                #  "panic"    ->  The node leaves the cluster by doing a kernel panic.
                #  "detach"   ->  The node drops its backing storage device, and
                #                 continues in disk less mode.
                #
                on-io-error   pass_on;

                # Under  fencing  we understand preventive measures to avoid situations where both nodes are
                # primary and disconnected (AKA split brain).
                fencing dont-care;

                # In case you only want to use a fraction of the available space
                # you might use the "size" option here.
                #
                # size 10G;
        }


        on testcluster1 {
                device          /dev/drbd1;
                disk            /dev/sdb2;
                address         10.0.0.1:7789;
                meta-disk       internal;
        }

        on testcluster2 {
                device          /dev/drbd1;
                disk            /dev/sdb2;
                address         10.0.0.2:7789;
                meta-disk       internal;
        }
}
Don't forget to copy this file to both node locations.
Now it's time to setup DRBD. Run the following commands on both nodes.
testcluster1 / # modprobe drbd
testcluster1 / # drbdadm create-md drbd0
testcluster1 / # drbdadm attach drbd0
testcluster1 / # drbdadm connect drbd0
testcluster1 / # drbdadm create-md drbd1
testcluster1 / # drbdadm attach drbd1

testcluster1 / # drbdadm connect drbd1
testcluster2 / # modprobe drbd
testcluster2 / # drbdadm create-md drbd0
testcluster2 / # drbdadm attach drbd0
testcluster2 / # drbdadm connect drbd0
testcluster2 / # drbdadm create-md drbd1
testcluster2 / # drbdadm attach drbd1

testcluster2 / # drbdadm connect drbd1

Now on the primary node run the following to synchronize both drbd disks:
testcluster1 / # drbdadm -- --overwrite-data-of-peer primary drbd0
testcluster1 / # drbdadm -- --overwrite-data-of-peer primary drbd1
At this point a full synchronization should be occurring. You can monitor the progress with the following command.
testcluster1 # / watch cat /proc/drbd
Code: monitor the synchronization progress
version: 8.0.11 (api:86/proto:86)
GIT-hash: b3fe2bdfd3b9f7c2f923186883eb9e2a0d3a5b1b build by root@localhost, 2008-04-18 11:35:09
 0: cs:SyncSource st:Primary/Secondary ds:UpToDate/Inconsistent C r---
    ns:2176 nr:0 dw:49792 dr:2196 al:17 bm:0 lo:0 pe:5 ua:0 ap:0
       [>....................] sync'ed:  0.4% (1050136/1052152)K
       finish: 0:43:45 speed: 336 (336) K/sec
       resync: used:0/31 hits:130 misses:1 starving:0 dirty:0 changed:1
       act_log: used:0/257 hits:12431 misses:35 starving:0 dirty:18 changed:17
 1: cs:SyncSource st:Primary/Secondary ds:UpToDate/Inconsistent C r---
    ns:163304 nr:0 dw:33252 dr:130072 al:13 bm:8 lo:0 pe:5 ua:0 ap:0
       [=>..................] sync'ed: 14.6% (893640/1044156)K
       finish: 0:49:38 speed: 296 (360) K/sec
       resync: used:0/31 hits:13317 misses:16 starving:0 dirty:0 changed:16
       act_log: used:0/257 hits:8300 misses:13 starving:0 dirty:0 changed:13
Depending on your hardware and the size of the partition, this could take some time. Later, when everything is synced, the mirroring will be very fast. See DRBD-Performance for more information.
You can now use the /dev/drbd0 and /dev/drbd1 as normal disks even before syncing has finished. So lets go ahead and format the disks. Use what ever format you want. Do this on first node. In this example we use ext3.
Formatting the disks:
testcluster1 / # mke2fs -j /dev/drbd0
testcluster1 / # mke2fs -j /dev/drbd1
Now setup the primary and secondary nodes. Notice these commands are different for each node.
testcluster1 # / drbdadm primary all
testcluster2 # / drbdadm secondary all
Make sure your add the mount points to the fstab and they are set to noauto. Again this needs to be done on both nodes.
File: /etc/fstab
#                                           

# NOTE: If your BOOT partition is ReiserFS, add the notail option to opts.
/dev/sda1               /boot           ext2            noatime         1 2
/dev/sda3               /               ext3            noatime         0 1
/dev/sda2               none            swap            sw              0 0
/dev/cdrom              /mnt/cdrom      audo            noauto,ro       0 0
#/dev/fd0               /mnt/floppy     auto            noauto          0 0
/dev/drbd0              /wwwjail/siteroot       ext3    noauto          0 0
/dev/drbd1              /wwwjail/mysql          ext3    noauto          0 0

proc                    /proc           proc            defaults        0 0

# glibc 2.2 and above expects tmpfs to be mounted at /dev/shm for
# POSIX shared memory (shm_open, shm_unlink).
# (tmpfs is a dynamically expandable/shrinkable ramdisk, and will
#  use almost no memory if not populated with files)
shm                     /dev/shm        tmpfs           nodev,nosuid,noexec     0 0
Time to create mount points, both nodes again.
testcluster1 / # mkdir -p /wwwjail/siteroot
testcluster1 / # mkdir -p /wwwjail/mysql
testcluster2 / # mkdir -p /wwwjail/siteroot

testcluster2 / # mkdir -p /wwwjail/mysql
You can mount them on the first node:
testcluster1 / # mount /wwwjail/siteroot
testcluster1 / # mount /wwwjail/mysql
Note: It is possible to mount the drbd partition on both nodes, but is not covered in this article. For more information see [1].
MySQL should already be installed but we need to configure it to use the DRBD device. We do that by simply putting all the databases and logs in /wwwjail/mysql. In a production environment, you'd probably break out logs, database and index files onto different devices. Since this is an experimental system, we'll just put everything into one resource.
Make sure no bind address is set because we need to bind to all interfaces and then limit access with iptables if need be. This needs to go on both nodes.
File: /etc/mysql/my.cf
...
#bind-address                           = 127.0.0.1
...
#datadir                                        = /var/lib/mysql
datadir                                         = /wwwjail/mysql
...
Now we need to install a mysql database to the shared drive. Issue the following command on both nodes.
testcluster1 / # mysql_install_db
testcluster2 / # mysql_install_db
Ok if everything has gone well to this point you can add DRBD to the startup items for both nodes, by adding them to the default runlevel.
testcluster1 / # rc-update add drbd default
testcluster2 / # rc-update add drbd default
Warning: Do not add drbd to your kernel's auto loaded modules! It *will* cause issues with heartbeat.
Note: It is unclear if you need to wait for the sync to finish at this point, but it might be a good idea anyway.
After syncing (unless your brave) you should be able to start DRBD normally:
testcluster1 / # /etc/init.d/drbd start
testcluster2 / # /etc/init.d/drbd start
The DRBD service should automatically load the drbd-kernel module automatically:
testcluster1 etc # lsmod
Code: kernel modules listing
Module                  Size  Used by
drbd                  142176  1 
You can revert the roles again to verify if it is syncing in both ways, when you're fast enough, you can issue these commands in a few seconds but you'll only see that drbd was faster ;-)
When you start a reboot during testing, you will have to issue
# drbdadm primary all
on the node were you want the data to be retrieved from, as DRBD does not remember the roles(for drbd both nodes are equal). We will do this automatically with heartbeat later.
If you have different data across each node then you may have a split brain. To fix this run the following command. Note that this assumes that testcluster1 is more up to date than testcluster2. If the opposite is true reverse the commands for each.

testcluster1 / # drbdadm connect all
testcluster2 / # drbdadm -- --discard-my-data connect all

[edit] Installing and Configuring Heartbeat

Heartbeat is based on init scripts so setting it up todo advanced things is not that difficult but its not going to be covered in this doc.
Again most of the information here needs to be run on both nodes. Go ahead and emerge heartbeat.
emerge -av heartbeat
Code: emerging heartbeat
These are the packages that would be merged, in order:

Calculating dependencies... done!
[ebuild  N    ] sys-cluster/heartbeat-2.0.7-r2  USE="-doc -ldirectord -management -snmp" 3,250 kB

Total: 1 package (1 new), Size of downloads: 3,250 kB

All of the heartbeat config is done in /etc/ha.d/. Again most of the important config files were not included by default on install so you will need to create them.
File: /etc/ha.d/ha.cf
# What interfaces to heartbeat over?
#udp  eth1
bcast eth1

# keepalive: how many seconds between heartbeats
keepalive 2

# Time in seconds before issuing a "late heartbeat" warning in the logs.
warntime 10

# Node is pronounced dead after 30 seconds.
deadtime 15

# With some configurations, the network takes some time to start working after a reboot.
# This is a separate "deadtime" to handle that case. It should be at least twice the normal deadtime.
initdead 30

# Mandatory. Hostname of machine in cluster as described by uname -n.
node    testcluster1
node    testcluster2


# When auto_failback is set to on once the master comes back online, it will take
# everything back from the slave.
auto_failback off

# Some default uid, gid info, This is required for ipfail
apiauth default uid=nobody gid=cluster
apiauth ipfail uid=cluster
apiauth ping gid=nobody uid=nobody,cluster

# This is to fail over if the outbound network connection goes down.
respawn cluster /usr/lib/heartbeat/ipfail

# IP to ping to check to see if the external connection is up.
ping 192.168.0.1
deadping 15

debugfile /var/log/ha-debug

# File to write other messages to
logfile /var/log/ha-log

# Facility to use for syslog()/logger
logfacility     local0
The haresources is probably the most important file to configure. This lists what init scripts need to be run and the parameters to pass to the script. The path for scripts are /etc/ha.d/resource.d/ followed by /etc/init.d.
Please note the init scripts need to follow Linux Standard Base Core Specification specifically with the function return codes.
Fix me: Can anyone lookup if this is the correct format for this file?
File: both nodes: /etc/ha.d/haresoures
testcluster1 IPaddr::192.168.0.100
testcluster1 drbddisk::drbd0 Filesystem::/dev/drbd0::/wwwjail/siteroot::ext3::noatime apache2
testcluster1 drbddisk::drbd1 Filesystem::/dev/drbd1::/wwwjail/mysql::ext3::noatime mysql
For example the IPaddr::192.168.0.100 will run the /etc/ha.d/resource.d/IPaddr script that will create a IP Alias on eth0 with the ipaddress 192.168.0.100.
Warning: The contents of the haresources file must be exactly the same on both nodes!
drbddisk will run the drbdadmin primary drbd0 and Filesystem is basically just a mount.
The last file tells heartbeat how to communicate with the other nodes. Because this example was based on a simulated crossover cable setup to connect the 2 nodes we will just use a crc check.
File: both nodes: /etc/ha.d/authkeys
auth 1
1 crc
If you plan on sending the heartbeat across the network you should use something a little stronger than crc. The follow is configuration for sha1.
File: both nodes: /etc/ha.d/authkeys
auth 1
1 sha1 MySharedSecretPassword
Finally because the /etc/ha.d/authkeys file may contain a plain text password please setup permissions on both nodes.
testcluster1 / # chown root:root /etc/ha.d/authkeys
testcluster1 / # chmod 600 /etc/ha.d/authkeys
testcluster2 / # chown root:root /etc/ha.d/authkeys

testcluster2 / # chmod 600 /etc/ha.d/authkeys

[edit] References