I came across Glusterfs the other day. On the surface it seemed similar to DRBD but after closer examination I did realize it was completely different. After some reading I came to realize it may offer benefits over DRDB and GNBD and it seemed extremely straight forward to implement so I decided to test a two-node cluster and run Apache on it.
Network Setup
The logical network setup is basically a two-node cluster using the server side replication capability (as opposite as the client side replication). In this fashion the client(s) which will mount the exported mount points only need to worry about serving the data trough an application, Apache in this case.
The HA will be achieved by using round-robin DNS (RRDNS) from the client to servers, that is when the client issues a request to the servers it will do so using a FQDN for the cluster, if one of the server nodes is down it will switch to the other node. When the failed node comes back the self healing properties of Glusterfs will ensure that data is replicated from the remaining node.
The nodes will be configured as follow:复制代码For my example I only have 1 client but this setup is easily extendable to 2 or more clients as well as two or more servers. My setup is of course nothing new is based on several examples at the Glusterfs-Wiki, I took a couple of them and redid the configuration to suite my test needs. In particular you should look at;Prep Work
Glusterfs has some dependencies:I will be using CentOS 5.2, the first 3 requirements do come standard with it, Fuse does not so you need to install it.
Install FUSE
Fuse is not available from the standard repository so you need to get it from RPMFORGE.
Follow the instructions and install support for the repository and YUM. Make sure to also configure YUM to use the priorities plug-in. This makes sure that the standard repositories are used before the RPMFORGE repositories if you use automatic updates or if you want to install an update for a particular package without breaking anything on your system.
When you have the repository support installed issue at a command prompt the following:复制代码The command above will install fuse and libraries plus any other packages needed by it.
Install the FUSE Kernel Module
Make sure you have the kernel-devel package for your kernel. At a prompt issue:复制代码You should see something like this:
Installed PackagesIf it says “installed” then you are ok otherwise you need to install it. Issue:The command above will install the kernel source and install the source for the Fuse kernel module.复制代码This will install the fuse.ko kernel module. Finally do a chkconfig fuse on to enable fuse support at boot up. Start the fuse service: service fuse start.
Repeat the above procedure on all 3 nodes (the client and the 2 servers).
Install Glusterfs
Download the latest here (1.3.9). If you are using a 64 bit architecture get the corresponding RPMs otherwise you will have to compile the RPMs from the SRPM using the following command: rpmbuild –rebuild glusterfs-1.3.9-1.src.rpm, this will create the following in /usr/src/redhat/RPMS/i386:Copy the files to the other nodes and install the RPMs with: rpm -ivh glusterfs*. Verify the installation was successful by issuing:复制代码No errors should be reported.
Round-Robin DNS
A key component of the HA setup is RRDNS. Though it is used only in one instance, it is a critical function - one which helps to ensure that the data can be served continuously even in the event that one of the storage servers becomes inaccessible.
Normally in a standard configuration a client will access the servers via their IP addresses. The major drawback if in this setup is that if the server becomes inaccessible the client will unable to access the data. This can be mitigated by using a hostname rather than addresses to access the servers.
Consider the following:So you need to configure the zone file for the mycluster.com zone to serve the corresponding records for all the server nodes and for the FQDN of the cluster. Configuration and setup of such DNS zone is well documented on the Internet so it will be left as an exercise for the reader.
Configure Glusterfs
Now that all the prep work, RPMS and RRDNS is in place we are ready to configure the cluster. The main key in the setup is the “AFR translator”; this is the mechanism that ensures data (“subvolumes”) is replicated between servers.
The reader is encouraged to visit the Gluster-Wiki and go over the fundamentals of Glusterfs and in particular over the performance options used in this setup (readahead, writeback, cache-size, etc).
Node 1
The following is the configuration of node 1:复制代码Node 2
The configuration of node 2:复制代码Client
Finally the client configuration:复制代码Start Gluster on Servers and Clients
On both servers make sure:On the client make sure:Make sure the FDQN cluster.mydomain.com resolves to both addresses (10.0.0.2, 10.0.0.3). Finally make sure the clock is synchronized on all three nodes.
Start gluster on both servers:复制代码Start Gluster on the client and mount /ha:复制代码Test that things are working ok by creating a new file:复制代码Now go to both servers and see if the file was created on both servers under /ha. If the file exists on both congratulations!
Configure Apache
The configuration of Apache is identical to the one we did for our two-node gnbd cluster. Copy all the files to the new root which will be /ha then on the client (node 0) change the configuration file to point to the correct root of the web server (i.e. /ha).
When done you should be able to go to the IP of the client via a web browser and get the Apache default page.
Testing Failover
To test failover simulate a failure of one of the nodes, for our test we will choose node 1. So issue:复制代码Wait a few seconds and then browse to the IP of node 0. You should get a response if you do congrats once more! You have a working two-node cluster.
Test recovery by restarting gluster on node 1, before you do touch a file on the client:复制代码The file should appear on node 2 but node 1 should not have it. Start gluster on node 1 after it starts you will notice that the file “test.txt” does not appear in /ha on node 1 even though the cluster is once again up.
Self-healing will eventually synchronize the files from node 2 to node 1 when the files that are new are accessed on node 2, if you want to force self-healing to happen then use a script that access those files that have changed or simply do a:复制代码The above will force self-healing, after you do this “test.txt” will appear on node 1.
Conclusions
The setup presented here will give you a two-node cluster using server-to-server replication and a client that will access the data. The setup can easily scale up to have 2 o more servers and 2 or more clients.
Of course in a production environment you should:
* Used bonded interfaces on all nodes (servers and clients).
* Used a dedicated network for cluster communication, see the article by Daniel Maher at the beginning of this post
* Used 64 bit architecture servers and dedicated storage for each. This will improve performance.
* Gluster recommends using client-server replication instead of server-server replication; however I believe there are advantages to using server-server replication and freeing the client from doing anything other than do what a client does just access data.
* Release 4.0 now has added HA as a translator thus IP addresses can be used now also as elements for failover, with this new translator having an internal DNS to resolve FQDN for the cluster becomes a non-issue since you will be able to use the HA IP address of the cluster instead.
* Use CARP or Heartbeat on the client side to give you additional HA on the client side.
Finally a setup like this gives you an inexpensive way of creating a cluster without the cost of a SAN for example. The Gluster-Wiki has some examples of people using it in production environments.
http://blog.miguelsarmiento.com/ ... lusterfs-and-centos
REFERENCE
http://www.thismail.org/bbs/viewthread.php?tid=3180
Network Setup
The logical network setup is basically a two-node cluster using the server side replication capability (as opposite as the client side replication). In this fashion the client(s) which will mount the exported mount points only need to worry about serving the data trough an application, Apache in this case.
The HA will be achieved by using round-robin DNS (RRDNS) from the client to servers, that is when the client issues a request to the servers it will do so using a FQDN for the cluster, if one of the server nodes is down it will switch to the other node. When the failed node comes back the self healing properties of Glusterfs will ensure that data is replicated from the remaining node.
The nodes will be configured as follow:
- * Node 0: 10.0.0.1 (client)
- * Node 1: 10.0.0.2
- * Node 2: 10.0.0.3
- * Node 1 and node 2 will have /ha exported
- * The client will mount /ha
* High-availability storage using server-side AFR by Daniel Maher
* Setting up AFR on two servers with server side replication by Brandon Lamb
Glusterfs has some dependencies:
1. Infiniband support if you use an infiniband network
2. Berkeley DB support
3. Extended attributes support for the backend system (exported filesystem) ext3 in our case.
4. FUSE is the primary requirement for Glusterfs.
Install FUSE
Fuse is not available from the standard repository so you need to get it from RPMFORGE.
Follow the instructions and install support for the repository and YUM. Make sure to also configure YUM to use the priorities plug-in. This makes sure that the standard repositories are used before the RPMFORGE repositories if you use automatic updates or if you want to install an update for a particular package without breaking anything on your system.
When you have the repository support installed issue at a command prompt the following:
- # yum -y install fuse fuse-devel
Install the FUSE Kernel Module
Make sure you have the kernel-devel package for your kernel. At a prompt issue:
- # yum info kernel-devel
Installed Packages
Name : kernel-devel
Arch : i686
Version : 2.6.18
Release : 92.el5
Size : 15 M
Repo : installed
Summary : Development package for building kernel modules to match the kernel.
Description : This package provides kernel headers and makefiles sufficient to build
: modules against the kernel package.
yum -y install kernel-devel-2.6.18-92.el5 kernel-headers-2.6.18-92.el5 dkms-fuse
- Change directories to /usr/src/fuse-2.7.4-1.nodist.rf and issue: ./configure; make install;
Repeat the above procedure on all 3 nodes (the client and the 2 servers).
Install Glusterfs
Download the latest here (1.3.9). If you are using a 64 bit architecture get the corresponding RPMs otherwise you will have to compile the RPMs from the SRPM using the following command: rpmbuild –rebuild glusterfs-1.3.9-1.src.rpm, this will create the following in /usr/src/redhat/RPMS/i386:
* glusterfs-1.3.9-1.i386.rpm
* glusterfs-devel-1.3.9-1.i386.rpm
* glusterfs-debuginfo-1.3.9-1.i386.rpm
- # glusterfs --version
Round-Robin DNS
A key component of the HA setup is RRDNS. Though it is used only in one instance, it is a critical function - one which helps to ensure that the data can be served continuously even in the event that one of the storage servers becomes inaccessible.
Normally in a standard configuration a client will access the servers via their IP addresses. The major drawback if in this setup is that if the server becomes inaccessible the client will unable to access the data. This can be mitigated by using a hostname rather than addresses to access the servers.
Consider the following:
$ host node1.mycluster.com
node1.mycluster.com has address10.0.0.2
$ host node2.mycluster.com
node2.mycluster.com has address 10.0.0.3
$ host cluster.mycluster.com
cluster.mycluster.com has address 192.168.0.110
cluster.mycluster.com has address 192.168.0.111
$ dig cluster.mycluster.com | grep -A 2 "ANSWER SECTION"
;; ANSWER SECTION:
cluster.mycluster.com. 3600 IN A 10.0.0.2
cluster.mycluster.com. 3600 IN A 10.0.0.3
Configure Glusterfs
Now that all the prep work, RPMS and RRDNS is in place we are ready to configure the cluster. The main key in the setup is the “AFR translator”; this is the mechanism that ensures data (“subvolumes”) is replicated between servers.
The reader is encouraged to visit the Gluster-Wiki and go over the fundamentals of Glusterfs and in particular over the performance options used in this setup (readahead, writeback, cache-size, etc).
Node 1
The following is the configuration of node 1:
- [root@node1 ~]# more /etc/glusterfs/glusterfs-server.vol
- # Dataspce on Node1
- volume gfs-ds
- type storage/posix
- option directory /ha
- end-volume
- # posix locks
- volume gfs-ds-locks
- type features/posix-locks
- subvolumes gfs-ds
- end-volume
- # Dataspace on Node2
- volume gfs-node2-ds
- type protocol/client
- option transport-type tcp/client
- option remote-host 10.0.0.3 # IP address of node2
- option remote-subvolume gfs-ds-locks
- option transport-timeout 5 # value in seconds
- end-volume
- # automatic file replication translator for dataspace
- volume gfs-ds-afr
- type cluster/afr
- subvolumes gfs-ds-locks gfs-node2-ds
- end-volume
- # the actual exported volume
- volume gfs
- type performance/io-threads
- option thread-count 8
- option cache-size 64MB
- subvolumes gfs-ds-afr
- end-volume
- volume server
- type protocol/server
- option transport-type tcp/server
- subvolumes gfs
- option auth.ip.gfs-ds- locks.allow 10.0.0.*,127.0.0.1
- option auth.ip.gfs.allow 10.0.0.*,127.0.0.1
- end-volume
The configuration of node 2:
- [root@node2 ~]# more /etc/glusterfs/glusterfs-server.vol
- # Dataspace on Node2
- volume gfs-ds
- type storage/posix
- option directory /ha
- end-volume
- # posix locks
- volume gfs-ds-locks
- type features/posix-locks
- subvolumes gfs-ds
- end-volume
- # Dataspace on Node1
- volume gfs-node1-ds
- type protocol/client
- option transport-type tcp/client
- option remote-host 10.0.0.2 # IP address of server1
- option remote-subvolume gfs-ds-locks
- option transport-timeout 5 # value in seconds
- end-volume
- # automatic file replication translator for dataspace
- volume gfs-ds-afr
- type cluster/afr
- subvolumes gfs-ds-locks gfs-node1-ds
- end-volume
- # the actual exported volume
- volume gfs
- type performance/io-threads
- option thread-count 8
- option cache-size 64MB
- subvolumes gfs-ds-afr
- end-volume
- volume server
- type protocol/server
- option transport-type tcp/server
- subvolumes gfs
- option auth.ip.gfs-ds-locks.allow 10.0.0.*,127.0.0.1
- option auth.ip.gfs.allow 10.0.0.*,127.0.0.1
- end-volume
Finally the client configuration:
- [root@node0 ~]# more /etc/glusterfs/glusterfs-client.vol
- # the exported volume to mount
- volume cluster
- type protocol/client
- option transport-type tcp/client # For TCP/IP transport
- option remote-host cluster.yourdomain.com # FQDN of server
- option remote-subvolume gfs # Exported volmue
- option transport-timeout 10 # Value in seconds
- end-volume
- # performance block for cluster # optional!
- volume writeback
- type performance/write-behind
- option aggregate-size 131072
- subvolumes cluster
- end-volume
- # performance block for cluster # optional!
- volume readahead
- type performance/read-ahead
- option page-size 65536
- option page-count 16
- subvolumes writeback
- end-volume
On both servers make sure:
* /ha exist
* If /ha is a mount point make sure the file system has been created (in our case ext3)
* Mount /ha on both servers
* The configuration files exist
* The fuse service is started and the kernel module is loaded.
* The client configuration exist
Start gluster on both servers:
- [root@node2 ~]# glusterfsd -f /etc/glusterfs/glusterfs-server.vol
- [root@node2 ~]# tail /var/log/glusterfsd.log
- [root@node0 ~]# glusterfs -f /etc/glusterfs/glusterfs-client.vol /ha
- [root@node0 ~]# cd /ha
- [root@node0 ~]# ls -l
- [root@node0 ~]# cd /ha
- [root@node0 ~]# touch file.txt
- [root@node0 ~]# ls
- file.txt
Configure Apache
The configuration of Apache is identical to the one we did for our two-node gnbd cluster. Copy all the files to the new root which will be /ha then on the client (node 0) change the configuration file to point to the correct root of the web server (i.e. /ha).
When done you should be able to go to the IP of the client via a web browser and get the Apache default page.
Testing Failover
To test failover simulate a failure of one of the nodes, for our test we will choose node 1. So issue:
- [root@node1 ~]# killall glusterfsd
Test recovery by restarting gluster on node 1, before you do touch a file on the client:
- [root@node0 ~]# echo "testing failure" > /ha/test.txt
Self-healing will eventually synchronize the files from node 2 to node 1 when the files that are new are accessed on node 2, if you want to force self-healing to happen then use a script that access those files that have changed or simply do a:
- [root@node2 ~]# ls -lR
Conclusions
The setup presented here will give you a two-node cluster using server-to-server replication and a client that will access the data. The setup can easily scale up to have 2 o more servers and 2 or more clients.
Of course in a production environment you should:
* Used bonded interfaces on all nodes (servers and clients).
* Used a dedicated network for cluster communication, see the article by Daniel Maher at the beginning of this post
* Used 64 bit architecture servers and dedicated storage for each. This will improve performance.
* Gluster recommends using client-server replication instead of server-server replication; however I believe there are advantages to using server-server replication and freeing the client from doing anything other than do what a client does just access data.
* Release 4.0 now has added HA as a translator thus IP addresses can be used now also as elements for failover, with this new translator having an internal DNS to resolve FQDN for the cluster becomes a non-issue since you will be able to use the HA IP address of the cluster instead.
* Use CARP or Heartbeat on the client side to give you additional HA on the client side.
Finally a setup like this gives you an inexpensive way of creating a cluster without the cost of a SAN for example. The Gluster-Wiki has some examples of people using it in production environments.
http://blog.miguelsarmiento.com/ ... lusterfs-and-centos
REFERENCE
http://www.thismail.org/bbs/viewthread.php?tid=3180