Friday, January 22, 2010

How To Create an iSCSI SAN using Heartbeat, DRBD, and OCFS2

SkyHi @ Friday, January 22, 2010
A client had a need recently for a Storage Area Network (SAN). We were expanding the system to a small cluster in order to prepare for future growth. For those of you unfamiliar with a cluster environment, when you have multiple servers accessing the same data, you need a SAN in order to protect the data. Well, let me clarify that, when you have multiple servers writing data you need a SAN. If they are only reading data, you can use something like NFS to export a drive.
Let me provide a quick cluster / file system intro here… In a cluster, each server is a ‘node’ of the cluster. In this case, the SAN is a server, and each server, or node, in the cluster, is actually a ‘client’. This is because there is 1 SAN server, and multiple ‘nodes’ accessing the same services/data, therefore, clients. And when you have multiple clients that have access to write to the data, you need a ‘Cluster File System’. A cluster file system ensures that files do not get corrupted by two nodes trying to write data to the same file at the same time. So, it manages file locks and write-ordering, which are critical to protecting data.
A big part of the reason that I was looking for an open-source SAN solution was cost. We needed a reliable, low-cost solution. We really wanted to keep the budget around $10,000 or less.
The solution?
Through some research and creative thinking, I realized we could build a SAN with commodity hardware, in this case, Dell 2950 Servers (2), and a Dell gigabit switch. For the software, we’d use CentOS 5 as the Operating System, DRBD to mirror the data, Heartbeat to manage the cluster and failover, OCFS2 as the filesystem, and iSCSI to export the drive. Let me give a quick overview of these items:

CentOS: A downstream version of a very prominent North American Linux distributor. Virtually the same exact code, without the requirement to buy support in order to get updates.
DRBD: An amazing piece of software that acts at the block level to replicate data between servers. Basically, this piece of software mirrors all data writes from one server to the other, creating a network RAID 1 (mirror).
Heartbeat: A package that allows two servers to communicate and check if the other is still ‘alive’. If the software detects that one server has died, any services running on the ‘dead’ server will automatically migrate over to the ‘live’ server.
OCFS2: A cluster filesystem developed by Oracle.
iSCSI: A network transport that wraps SCSI commands, enabling a ‘node’ to mount a drive, as if the drive were a local drive. iSCSI provides a reliable method for reading/writing data to a drive that is on a completely separate server, and could be on a separate network.

The Setup
I installed the software the exact same on both Dell servers. The two dell servers were then connected to the gigabit switch (192.168.1.x), which also connected the other nodes to the SAN. I would NOT recommend using a 100Mbps switch, as you’ll likely saturate the line. 1Gbps provided me with enough bandwidth. The two SAN servers were also connected with a crossover cable on a separate network (10.0.0.x). This allowed DRBD to have a dedicated 1Gbps line to replicate data between the two servers. The ‘nodes’ in the cluster mount the iSCSI drive over the local network.

Essentially, I followed all of the typical configuration steps for the above software. You’ll need a working installation of CentOS. Nothing too special, though. I prefer to keep my servers slim and install only the minimum, leaving off the graphical interface. The only exception to this is the cluster nodes, because OCFS requires the GUI in order to set-up and maintain the cluster configuration.
I installed DRBD via yum:
yum install drbd

My DRBD configuration file looks like this:
global {
    usage-count yes;

common {
  syncer { rate 100M; }

resource drbd0 {

  protocol C;

  handlers {
    pri-on-incon-degr “echo o > /proc/sysrq-trigger ; halt -f”;

    pri-lost-after-sb “echo o > /proc/sysrq-trigger ; halt -f”;

    local-io-error “echo o > /proc/sysrq-trigger ; halt -f”;

    outdate-peer “/usr/lib64/heartbeat/drbd-peer-outdater”;

  startup {
    wfc-timeout  600;

    degr-wfc-timeout 120;    # 2 minutes.

  disk {
    on-io-error   detach;

    fencing resource-only;

  net {
    cram-hmac-alg “sha1″;
    shared-secret “ImNotPostingThatOnTheInternet”;
    after-sb-0pri disconnect;

    after-sb-1pri disconnect;

    after-sb-2pri disconnect;

    rr-conflict disconnect;

  on {   
    device     /dev/drbd0;
    disk       /dev/sdb1;
    flexible-meta-disk  internal;

  on {
    device    /dev/drbd0;
    disk      /dev/sdb1;
    flexible-meta-disk internal;

While the full configuration of DRBD is outside the scope of this post (and LinBit has a great User Guide), there are a few things to point out. First off, I have DRBD communicating on its own private network, on port 7788. Also, note that /dev/sb1 is where my 1TB RAID array is located (as far as CentOS is concerned). This is my shared storage.

So, at this point we have DRBD working. Next, let’s get Heartbeat up and running to manage the drive and failover. Again, full configuration of Heartbeat is outside the scope of this posting, but again, great information is available on their web site.
For this set-up, my heartbeat config (cib.xml) looks like this (I apologize in advance for the wrapping):

The only real things to point out here are that I have a ‘drbddisk’ defined and I’ve told heartbeat to manage it, as well as that I have defined a virtual IP address - This IP will ‘float’ between the two servers, depending on which server is active. THIS IS THE IP THAT ISCSI WILL RUN ON!!! That’s critical. That allows me to configure the nodes to look for iSCSI drives on that IP and still have the cluster failover properly. That means, if the primary server fails, the secondary server should come up and essentially tell DRBD ‘you are now primary, take over’. The other piece to heartbeat is the, which is pretty straightforward:

keepalive 3
deadtime 15
warntime 9
initdead 30
auto_failback off
udpport 694
baud 19200
serial /dev/ttyS0
crm yes
bcast eth0
use_logd yes
respawn hacluster /usr/lib64/heartbeat/dopd
apiauth dopd gid=haclient uid=hacluster

Basically, this file defines the cluster members and how the cluster will communicate. In my case, I’m communicating over a serial port AND over ethernet. This is critical - you need more than one communication path. If you only communicate over ethernet and your network has a small hiccup, both nodes will think the other went away and will BOTH try to be primary, which can be a very bad thing.

OK, so here we are, we have DRBD set up to mirror the data between the two servers and heartbeat set up to manage the failover. Almost there. Now, we need a way to export the device. For this, I’m using iSCSI. 

I choose to use IETD - iSCSI Enterprise Target. Installation was really quite simple. I downloaded the file, untarred it, and just followed the instructions in the README - basically, just make && make install. Configuration is pretty straight-forward, too. There is a default config file that pretty much has everything you need, just a few things to update, such as your target name. Honestly, the file is commented so good that I’m not going to post it. All I had to do was specify the Target name, the authentication username/password (which is so basic that it hardly provides an protection) and the location of the drive to export.

This is the one thing that threw me off for a bit - you MUST EXPORT THE DRBD DRIVE. If you export your physical drive, the data will not be mirrored. So, in my case, this looked like:
Lun 0 Path=/dev/drbd0,Type=fileio

Now, I’ve got everything I need set up on the SAN servers. Remember, I have 2 servers for the failover, so all of this must be done on both servers. Time for the nodes…

Each node will need OCFS2 and OCFS2 Tools installed. Again, this is beyond the scope of this document. However, Oracle provides RPMS for RHEL, which will install just fine on CentOS. Again, for the initial setup, you need the GUI (X-Windows, Gnome, KDE, etc). The GUI generates a config file for OCFS2. Just follow the instructions provided by Oracle for this step. It’s really pretty simple. You’re basically just naming the nodes and assigning them to a cluster.

The only trick with the nodes is to discover the iSCSI drive and add it to /etc/fstab so that it finds it on boot. You’ll need iscsi-initiator-utils for RHEL5 or CentOS 5. Basically, once you have that installed, you’ll just need to:
iscsiadm -m discovery -t sendtargets -p

You can then login to the device you find with:

iscsiadm -m node -T -p :3260,1 -l

To have the discovery happen automatically, you’ll need to create a file in /etc/iscsi/ called ‘initiatorname.iscsi’. You can read more at this site.

After you can see the drive, make sure you format it with OCFS. Again, this is very simple through the GUI that Oracle provides.

At this point, we’re ready to edit the /etc/fstab file. I added:
/dev/sdb                /data                   ocfs2   _netdev         0 0

The _netdev command tells the OS that this drive is mounted over the network, so we need to wait for the network to be available before attempting to mount this drive. By this point, OCFS will have started up, so it will be ready to handle the drive. You can see that I’m mounting mine to a /data directory.

Next? Test it out!! Make sure that OCFS is set to start on boot on all of the nodes, which it should be. 

What would I do differently?
I’d buy a better switch. One that had more robust support for VLAN’s. Basically, the dell switch I bought is a web-managed switch. I’d really recommend a fully-managed switch for this type of set-up. Yes, it’s more expensive, but it will give you much better VLAN control and growth ability.

I didn’t even look into GFS when I set this up. I’d now explore it as an alternative to OCFS2, mainly due to the fact that RedHat developed and maintains it, which makes me think that RHEL and CentOS will probably integrate better with it, than with OCFS2.

Additional Notes:
Please, please, please do NOT put this kind of setup on the internet. It’s not really that safe. Yes, your web servers will need to be public. That’s what a DMZ is for. Please make sure you have some sort of firewall between the nodes that are public and the SAN servers. If not, your data is at risk…

Lastly, I’ll gladly take comments / questions on this setup. It can be tricky to get right, but it’s usually a small configuration problem. Make sure you test each individual step along the way. Make sure you get DRBD working right, then add in Heartbeat, then add in iSCSI, then add in OCFS. If you try to do it all at once and hope it works, you’ll have a heck of a time trying to figure out where your problem is…