Friday, January 22, 2010

High Availability without Expensive Hardware

SkyHi @ Friday, January 22, 2010

The Problem:
A customer called us in a state of panic telling us that the MySQL server hosting the database for their application had a hardware failure and the data was not recoverable. We were asked to get them back online as soon as possible plus re-architect their environment to prevent a disaster from happening again.
The Goal:
Engineer a cost effective solution that will provide a highly available web server and database server architecture without using a load balancer.
The Solution:
A Highly Available web and database architecture utilizing open source software available for free on the Internet. This solution will consist of DRBD (Distributed Replicated Block Device) software used for replicating block devices over the network, Heartbeat software which provides death-of-node detection and MySQL Master-Slave configuration used to replicate databases to another host.

Overview of the Solution:
This solution allows for a failure of one web and/or database server without interruption of service using DRBD/Heartbeat and MySQL Master-Slave configuration. DRBD and Heartbeat allow you to configure one of the web servers to be in standby mode. No HTTP requests go the Standby server. The purpose of the Standby server is to be available in case the Active server fails, and take its place. The Standby server has identical system configurations, Apache configurations, and the “/var/www” directory holding the application files. Keeping the application files identical is accomplished using DRBD which runs on both servers and replicates “/var/www” to the Standby server. Heartbeat is a piece of software that runs on both servers as well and actively monitors that both the Active and the Standby server is online. Whenever the Active server becomes unreachable via Heartbeat the failover process will be initiated and the Standby server will be promoted to Active. During the failover process the Standby server assumes the primary IP address of the Active server, mounts “/var/www/” and starts Apache. Although DRBD and Heartbeat are different pieces of software, their configuration files are intertwined and both of them work together to make the failover process possible.
Up to this point we have been talking about highly available web servers, but what about database servers and making them highly available. This is where Master-Slave replication can be used. The currently active web server connects to the Master database server to execute queries by default. Whenever any changes are made to the databases on the Master all data is immediately synced to the Slave. In an event that the Master fails the applications logic will automatically route all the queries to the Slave making the failure transparent.
You may have noticed after looking at the diagram that the Slave database server and the Standby web server are on the same physical machine. This was done to eliminate cost of adding a dedicated Slave database server, and to utilize the hardware for two purposes. If you are able to allocate an additional server for this solution I would recommend a dedicated database Slave server.
How DRDB and Heartbeat works under the hood:
DRDB software runs on both web servers and replicates the underlying file system that is mounted on “/var/www” block by block over the network. This replication happens over a dedicated crossover cable hooked up to “eth1” interfaces on both servers. This dedicated link provides a high performance connection between the systems bypassing all network hardware like switches and routers that can fail. During normal operation the “/var/www” is only mounted on the Active server. This means that data is only written on the Active web server and then replicated to the Standby.
Heartbeat software runs on both web servers to monitor whether both nodes are online. The Heartbeat signal is sent over the crossover cable hooked up to “eth1” interfaces on both servers. The crossover cable connection insures that there is no latency or network timeouts that could trigger a false fail over when the signal does not reach one of the nodes.
Each Active and Standby server has one primary interface “eth0” which is bound to IP address of “” or” and is used for connecting to the servers using “SSH” and can be called the system’s administration interface. The Active server also has a virtual interface “eth0:1” which is bound to an IP address of “” and is used for HTTP traffic. In most situations this IP address is translated to a public IP on the Internet using NAT and is associated with a domain name. The magic of heartbeat takes place when the Active node becomes unreachable and the Standby node brings up “eth0:1” interface and binds it to “”, mounts “/var/www/”, and starts Apache. After this sequence of events is completed the Standby server takes the role of Active. This process takes seconds and is usually invisible to the users. After the failover event Heartbeat monitors the original Active server and when it comes back online it can be setup to automatically fail over to the original state.

How MySQL Master-Slave works under the hood:

Replication enables data from a Master database server to be replicated to a Slave database server. The Master server records all queries that are executed in a binary log file and stores it on the file system. The Slave then connects to the Master, retrieves the binary log and replays the “Write” queries keeping itself in sync. This process can be very fast depending on the type of queries, and provides a working copy of the database on another server at all times and can be used in case the Master server fails. In this particular solution the application has logic that will automatically re-route queries to the working server in case one fails providing high availability on the database layer.