Friday, January 22, 2010

A HA Two-Node Server Side Cluster Using Glusterfs and CentOS

SkyHi @ Friday, January 22, 2010
I came across Glusterfs the other day. On the surface it seemed similar to DRBD but after closer examination I did realize it was completely different. After some reading I came to realize it may offer benefits over DRDB and GNBD and it seemed extremely straight forward to implement so I decided to test a two-node cluster and run Apache on it.
Network Setup

The logical network setup is basically a two-node cluster using the server side replication capability (as opposite as the client side replication). In this fashion the client(s) which will mount the exported mount points only need to worry about serving the data trough an application, Apache in this case.

The HA will be achieved by using round-robin DNS (RRDNS) from the client to servers, that is when the client issues a request to the servers it will do so using a FQDN for the cluster, if one of the server nodes is down it will switch to the other node. When the failed node comes back the self healing properties of Glusterfs will ensure that data is replicated from the remaining node.

The nodes will be configured as follow:
  1.     * Node 0: 10.0.0.1 (client)
  2.     * Node 1: 10.0.0.2
  3.     * Node 2: 10.0.0.3
  4.     * Node 1 and node 2 will have /ha exported
  5.     * The client will mount /ha
复制代码
For my example I only have 1 client but this setup is easily extendable to 2 or more clients as well as two or more servers. My setup is of course nothing new is based on several examples at the Glusterfs-Wiki, I took a couple of them and redid the configuration to suite my test needs. In particular you should look at;
    * High-availability storage using server-side AFR by Daniel Maher
    * Setting up AFR on two servers with server side replication by Brandon Lamb
Prep Work

Glusterfs has some dependencies:
   1. Infiniband support if you use an infiniband network
   2. Berkeley DB support
   3. Extended attributes support for the backend system (exported filesystem) ext3 in our case.
   4. FUSE is the primary requirement for Glusterfs.
I will be using CentOS 5.2, the first 3 requirements do come standard with it, Fuse does not so you need to install it.

Install FUSE

Fuse is not available from the standard repository so you need to get it from RPMFORGE.

Follow the instructions and install support for the repository and YUM. Make sure to also configure YUM to use the priorities plug-in. This makes sure that the standard repositories are used before the RPMFORGE repositories if you use automatic updates or if you want to install an update for a particular package without breaking anything on your system.

When you have the repository support installed issue at a command prompt the following:
  1. # yum -y install fuse fuse-devel
复制代码
The command above will install fuse and libraries plus any other packages needed by it.

Install the FUSE Kernel Module

Make sure you have the kernel-devel package for your kernel. At a prompt issue:
  1. # yum info kernel-devel
复制代码
You should see something like this:

Installed Packages
Name                 : kernel-devel
Arch                 : i686
Version         : 2.6.18
Release         : 92.el5
Size                 : 15 M
Repo                 : installed
Summary         : Development package for building kernel modules to match the kernel.
Description         : This package provides kernel headers and makefiles sufficient to build
                : modules against the kernel package.
If it says “installed” then you are ok otherwise you need to install it. Issue:
yum -y install kernel-devel-2.6.18-92.el5 kernel-headers-2.6.18-92.el5 dkms-fuse
The command above will install the kernel source and install the source for the Fuse kernel module.
  1. Change directories to /usr/src/fuse-2.7.4-1.nodist.rf and issue: ./configure; make install;
复制代码
This will install the fuse.ko kernel module. Finally do a chkconfig fuse on to enable fuse support at boot up. Start the fuse service: service fuse start.

Repeat the above procedure on all 3 nodes (the client and the 2 servers).

Install Glusterfs

Download the latest here (1.3.9). If you are using a 64 bit architecture get the corresponding RPMs otherwise you will have to compile the RPMs from the SRPM using the following command: rpmbuild –rebuild glusterfs-1.3.9-1.src.rpm, this will create the following in /usr/src/redhat/RPMS/i386:
    * glusterfs-1.3.9-1.i386.rpm
    * glusterfs-devel-1.3.9-1.i386.rpm
    * glusterfs-debuginfo-1.3.9-1.i386.rpm
Copy the files to the other nodes and install the RPMs with: rpm -ivh glusterfs*. Verify the installation was successful by issuing:
  1. # glusterfs --version
复制代码
No errors should be reported.

Round-Robin DNS

A key component of the HA setup is RRDNS. Though it is used only in one instance, it is a critical function - one which helps to ensure that the data can be served continuously even in the event that one of the storage servers becomes inaccessible.

Normally in a standard configuration a client will access the servers via their IP addresses. The major drawback if in this setup is that if the server becomes inaccessible the client will unable to access the data. This can be mitigated by using a hostname rather than addresses to access the servers.

Consider the following:
$ host node1.mycluster.com
node1.mycluster.com has address10.0.0.2
$ host node2.mycluster.com
node2.mycluster.com has address 10.0.0.3
$ host cluster.mycluster.com
cluster.mycluster.com has address 192.168.0.110
cluster.mycluster.com has address 192.168.0.111
$ dig cluster.mycluster.com | grep -A 2 "ANSWER SECTION"
;; ANSWER SECTION:
cluster.mycluster.com. 3600 IN A 10.0.0.2
cluster.mycluster.com. 3600 IN A 10.0.0.3
So you need to configure the zone file for the mycluster.com zone to serve the corresponding records for all the server nodes and for the FQDN of the cluster. Configuration and setup of such DNS zone is well documented on the Internet so it will be left as an exercise for the reader.
Configure Glusterfs

Now that all the prep work, RPMS and RRDNS is in place we are ready to configure the cluster. The main key in the setup is the “AFR translator”; this is the mechanism that ensures data (“subvolumes”) is replicated between servers.

The reader is encouraged to visit the Gluster-Wiki and go over the fundamentals of Glusterfs and in particular over the performance options used in this setup (readahead, writeback, cache-size, etc).

Node 1

The following is the configuration of node 1:
  1. [root@node1 ~]# more /etc/glusterfs/glusterfs-server.vol
  2. # Dataspce on Node1
  3. volume gfs-ds
  4. type storage/posix
  5. option directory /ha
  6. end-volume
  7. # posix locks
  8. volume gfs-ds-locks
  9. type features/posix-locks
  10. subvolumes gfs-ds
  11. end-volume
  12. # Dataspace on Node2
  13. volume gfs-node2-ds
  14. type protocol/client
  15. option transport-type tcp/client
  16. option remote-host 10.0.0.3 # IP address of node2
  17. option remote-subvolume gfs-ds-locks
  18. option transport-timeout 5 # value in seconds
  19. end-volume
  20. # automatic file replication translator for dataspace
  21. volume gfs-ds-afr
  22. type cluster/afr
  23. subvolumes gfs-ds-locks gfs-node2-ds
  24. end-volume
  25. # the actual exported volume
  26. volume gfs
  27. type performance/io-threads
  28. option thread-count 8
  29. option cache-size 64MB
  30. subvolumes gfs-ds-afr
  31. end-volume
  32. volume server
  33. type protocol/server
  34. option transport-type tcp/server
  35. subvolumes gfs
  36. option auth.ip.gfs-ds- locks.allow 10.0.0.*,127.0.0.1
  37. option auth.ip.gfs.allow 10.0.0.*,127.0.0.1
  38. end-volume
复制代码
Node 2

The configuration of node 2:
  1. [root@node2 ~]# more /etc/glusterfs/glusterfs-server.vol
  2. # Dataspace on Node2
  3. volume gfs-ds
  4. type storage/posix
  5. option directory /ha
  6. end-volume
  7. # posix locks
  8. volume gfs-ds-locks
  9. type features/posix-locks
  10. subvolumes gfs-ds
  11. end-volume
  12. # Dataspace on Node1
  13. volume gfs-node1-ds
  14. type protocol/client
  15. option transport-type tcp/client
  16. option remote-host 10.0.0.2 # IP address of server1
  17. option remote-subvolume gfs-ds-locks
  18. option transport-timeout 5 # value in seconds
  19. end-volume
  20. # automatic file replication translator for dataspace
  21. volume gfs-ds-afr
  22. type cluster/afr
  23. subvolumes gfs-ds-locks gfs-node1-ds
  24. end-volume
  25. # the actual exported volume
  26. volume gfs
  27. type performance/io-threads
  28. option thread-count 8
  29. option cache-size 64MB
  30. subvolumes gfs-ds-afr
  31. end-volume
  32. volume server
  33. type protocol/server
  34. option transport-type tcp/server
  35. subvolumes gfs
  36. option auth.ip.gfs-ds-locks.allow 10.0.0.*,127.0.0.1
  37. option auth.ip.gfs.allow 10.0.0.*,127.0.0.1
  38. end-volume
复制代码
Client

Finally the client configuration:
  1. [root@node0 ~]# more /etc/glusterfs/glusterfs-client.vol
  2. # the exported volume to mount
  3. volume cluster
  4. type protocol/client
  5. option transport-type tcp/client # For TCP/IP transport
  6. option remote-host cluster.yourdomain.com # FQDN of server
  7. option remote-subvolume gfs # Exported volmue
  8. option transport-timeout 10 # Value in seconds
  9. end-volume
  10. # performance block for cluster # optional!
  11. volume writeback
  12. type performance/write-behind
  13. option aggregate-size 131072
  14. subvolumes cluster
  15. end-volume
  16. # performance block for cluster # optional!
  17. volume readahead
  18. type performance/read-ahead
  19. option page-size 65536
  20. option page-count 16
  21. subvolumes writeback
  22. end-volume
复制代码
Start Gluster on Servers and Clients

On both servers make sure:
    * /ha exist
    * If /ha is a mount point make sure the file system has been created (in our case ext3)
    * Mount /ha on both servers
    * The configuration files exist
On the client make sure:
    * The fuse service is started and the kernel module is loaded.
    * The client configuration exist
Make sure the FDQN cluster.mydomain.com resolves to both addresses (10.0.0.2, 10.0.0.3). Finally make sure the clock is synchronized on all three nodes.

Start gluster on both servers:
  1. [root@node2 ~]# glusterfsd -f /etc/glusterfs/glusterfs-server.vol
  2. [root@node2 ~]# tail /var/log/glusterfsd.log
复制代码
Start Gluster on the client and mount /ha:
  1. [root@node0 ~]# glusterfs -f /etc/glusterfs/glusterfs-client.vol /ha
  2. [root@node0 ~]# cd /ha
  3. [root@node0 ~]# ls -l
复制代码
Test that things are working ok by creating a new file:
  1. [root@node0 ~]# cd /ha
  2. [root@node0 ~]# touch file.txt
  3. [root@node0 ~]# ls
  4. file.txt
复制代码
Now go to both servers and see if the file was created on both servers under /ha. If the file exists on both congratulations!
Configure Apache

The configuration of Apache is identical to the one we did for our two-node gnbd cluster. Copy all the files to the new root which will be /ha then on the client (node 0) change the configuration file to point to the correct root of the web server (i.e. /ha).

When done you should be able to go to the IP of the client via a web browser and get the Apache default page.
Testing Failover

To test failover simulate a failure of one of the nodes, for our test we will choose node 1. So issue:
  1. [root@node1 ~]# killall glusterfsd
复制代码
Wait a few seconds and then browse to the IP of node 0. You should get a response if you do congrats once more! You have a working two-node cluster.

Test recovery by restarting gluster on node 1, before you do touch a file on the client:
  1. [root@node0 ~]# echo "testing failure" > /ha/test.txt
复制代码
The file should appear on node 2 but node 1 should not have it. Start gluster on node 1 after it starts you will notice that the file “test.txt” does not appear in /ha on node 1 even though the cluster is once again up.

Self-healing will eventually synchronize the files from node 2 to node 1 when the files that are new are accessed on node 2, if you want to force self-healing to happen then use a script that access those files that have changed or simply do a:
  1. [root@node2 ~]# ls -lR
复制代码
The above will force self-healing, after you do this “test.txt” will appear on node 1.
Conclusions

The setup presented here will give you a two-node cluster using server-to-server replication and a client that will access the data. The setup can easily scale up to have 2 o more servers and 2 or more clients.

Of course in a production environment you should:

    * Used bonded interfaces on all nodes (servers and clients).
    * Used a dedicated network for cluster communication, see the article by Daniel Maher at the beginning of this post
    * Used 64 bit architecture servers and dedicated storage for each. This will improve performance.
    * Gluster recommends using client-server replication instead of server-server replication; however I believe there are advantages to using server-server replication and freeing the client from doing anything other than do what a client does just access data.
    * Release 4.0 now has added HA as a translator thus IP addresses can be used now also as elements for failover, with this new translator having an internal DNS to resolve FQDN for the cluster becomes a non-issue since you will be able to use the HA IP address of the cluster instead.
    * Use CARP or Heartbeat on the client side to give you additional HA on the client side.

Finally a setup like this gives you an inexpensive way of creating a cluster without the cost of a SAN for example. The Gluster-Wiki has some examples of people using it in production environments.

http://blog.miguelsarmiento.com/ ... lusterfs-and-centos

REFERENCE
http://www.thismail.org/bbs/viewthread.php?tid=3180