Command Center: Replicating Content Between USA, Japan (Asia) and UK (Europe) Webservers

Friday, November 12, 2010

Replicating Content Between USA, Japan (Asia) and UK (Europe) Webservers

Category: rsync, scripts — SkyHi @ Friday, November 12, 2010

We have corporate intranet network for our web site as us.example.com, jp.example.com (asia.example.com), uk.example.com (eu.example.com). How do I replicate static content stored at /var/www/corporate_lan/ such as javascript files, css files, and images between our USA, Japan and UK web servers running under UNIX or CentOS or Redhat Enterprise Linux based Apache servers?

There are various solutions exists to replicate static files and dynamic web site across the globe. Replicating set of static files is pretty easy.

Sample Setup

eth1:67.1.2.3
                                                +-----------------+
                                                | us.example.com  |
 eth1:202.54.1.2                                +-----------------+
+----------------------+                        |
|  content.example.com |------------------------+
+----------------------+       VPN/intranet     | eth1:87.1.2.3
   /                                            +-----------------+
   /                                            | uk.example.com  |
   |                                            +-----------------+
   +/var/www/corporate_lan/                     |
                          /css/                 | eth1:123.1.2.3
                          /images/              +-----------------+
                          /js/                  | jp.example.com  |
                          /php_cgi/             +-----------------+
                          /perl_cgi/
                          /java_app/
                          /python_app1/

Where,

All server runs same version of UNIX or Linux and Apache.
DocumentRoot is same for all servers.
content.example.com - Your main file server. You need to update or upload all static files here only. Do not upload or create files in other servers. You can now push updates or mirror directories from this server to rest of the nodes.
us.example.com - Your USA based web server. This server will sync to (or mirror directories from) upstream server called content.example.com.
jp.example.com - Your Japan based web server. This server will sync to (or mirror directories from) upstream server called content.example.com.
uk.example.com - Your UK web server. This server will sync to (or mirror directories from) upstream server called content.example.com.
All offices are connected using secure vpn or an an intranet - a private computer network that uses Internet Protocol technologies to securely share any part of an organization's network operating system.

Solution # 1: Mirroring Using rsync

You can use rsync application to synchronizes files and directories from content.example.com to another locations such as us.example.com while minimizing data transfer using delta encoding when appropriate. rsync can copy directory contents and files using compression and recursion. You must install rsync on all servers. Type the following command on content.example.com to replicate /var/www/corporate_lan/ to all three servers as follows:

 
rsync -av /var/www/corporate_lan root@us.example.com:/var/www/
rsync -av /var/www/corporate_lan root@uk.example.com:/var/www/
rsync -av /var/www/corporate_lan root@jp.example.com:/var/www/

To replicate only /var/www/corporate_lan/css directory, enter:

 
rsync -av /var/www/corporate_lan/css root@us.example.com:/var/www/
rsync -av /var/www/corporate_lan/css root@uk.example.com:/var/www/
rsync -av /var/www/corporate_lan/css root@jp.example.com:/var/www/

--delete option

You can delete files that don't exist on /var/www/corporate_lan using the following syntax. So if you type on content.example.com:
# rm /var/www/corporate_lan/images/new_logo.png
Remove all deleted files from the rest of the all servers i.e. keep exact mirror of content.example.com, enter:

 
rsync -av --delete /var/www/corporate_lan root@us.example.com:/var/www/
rsync -av --delete /var/www/corporate_lan root@uk.example.com:/var/www/
rsync -av --delete /var/www/corporate_lan root@jp.example.com:/var/www/

The -a option works as follows:

Recurse into directories
Copy symlinks as symlinks
Preserve all file permissions (so make sure you use same usernames on all servers)
Preserve group file permissions
Preserve owner file permissions (you need to run rsync as root)
Preserve times

You can compress file data during the transfer using -z or --compress option

 
rsync -z -av --delete /var/www/corporate_lan root@us.example.com:/var/www/

The --compress-level=NUM with explicitly set compression level:

rsync -z --compress-level=5 -av --delete /var/www/corporate_lan root@us.example.com:/var/www/

Excluding files

You can exclude files as follows:

rsync -z --compress-level=5 -av --delete --exclude='cache/*' --exclude='*~'  /var/www/corporate_lan root@us.example.com:/var/www/

You can create a pattern file as follows (/root/mirror.exclude)

cache/*
/dev/
/.conf/
*~

The --exclude-from=/root/mirror.exclude option read exclude patterns from /root/mirror.exclude:

rsync -z --compress-level=5 -av --delete --exclude-from=/root/mirror.exclude  /var/www/corporate_lan root@us.example.com:/var/www/

Sample rsync server mirroring shell script

You can create a shell script (say /root/mirror.dirs) to sync every 30 minutes or as per your requirements to mirror the directories and files:

#!/bin/bash
# Usage: Mirror directories and files to our US, UK and Japan based server.
# --------------------------------------------------------------------------
_upstream="/var/www/corporate_lan"
_servers="root@us.example.com:/var/www/ root@uk.example.com:/var/www/ root@jp.example.com:/var/www/"
_rsync="/usr/bin/rsync"
_exclude="/root/mirror.exclude"
_log="/var/log/rsync_mirror.log"
_opts=""
for e in $_servers
do
        [ -f "${_exclude}" ] && _opts="--exclude-from=$_exclude"
        $_rsync -z -a --delete $_opts  "$_upstream" "$e"
done &>$_log

Run once an hour using cron i.e. mirror server once an hour:

@hourly /root/mirror.dirs

How Do I Call /root/mirror.dir As Soon As New Static File Uploaded In /var/www/corporate_lan?

You can use the inotify cron daemon to monitors filesystem events and executes /root/mirror.dirs script:

/var/www/corporate_lan/css/ IN_CLOSE_WRITE,IN_CREATE,IN_DELETE /root/mirror.dirs
/var/www/corporate_lan/images/ IN_CLOSE_WRITE,IN_CREATE,IN_DELETE /root/mirror.dirs
/var/www/corporate_lan/js/ IN_CLOSE_WRITE,IN_CREATE,IN_DELETE /root/mirror.dirs

See how to configure and install inotify under Linux based systems.

Solution # 2: Mirroring Using unison

You synchronizing files between a server called content.example.com and another server called us.example.com while keeping the same version of files on multiple servers. Unison allows two replicas of a collection of files and directories to be stored on different servers, modified separately, and then brought up to date by propagating the changes in each replica to the other. In this example, your webmaster can upload new_logo.png to us.example.com and it will get replicated to rest of all servers. Similarly if new_logo.png deleted from content.example.com, it will get deleted from rest of all servers. You can use it as follows:
# unison -batch /var/www/corporate_lan ssh://us.example.com//var/www/corporate_lan
To just replicate /css/ part, enter
# unison -batch /var/www/corporate_lan/css ssh://us.example.com//var/www/corporate_lan/css

Sample unison server mirroring shell script

Create a shell script called /root/unison.mirror.sh:

#!/bin/bash
_paths="/var/www/corporate_lan/css \
/var/www/corporate_lan/images \
/var/www/corporate_lan/js"
_unison=/usr/bin/unison
_rserver="us.example.com uk.example.com jp.example.com"
for p in ${_paths}
do
 ${_unison} -batch "${p}"  "ssh://${_rserver}/${p}"
done

Run once an hour using cron i.e. mirror server once an hour:
@hourly /root/unison.mirror.sh
As explained earlier, you can call this script on demand too using inotify cron daemon

/var/www/corporate_lan/css/ IN_CLOSE_WRITE,IN_CREATE,IN_DELETE /root/unison.mirror.sh
/var/www/corporate_lan/images/ IN_CLOSE_WRITE,IN_CREATE,IN_DELETE /root/unison.mirror.sh
/var/www/corporate_lan/js/ IN_CLOSE_WRITE,IN_CREATE,IN_DELETE /root/unison.mirror.sh

Solution #3: Use 3rd Party Cloud Based Solution

You can use 3rd party cloud computing infrastructure (such as Amazon S3, and your data is automatically replicated across multiple Availability Zones) to sync all your data in three different data centers. The discussion regarding cloud computing is beyond the scope of this FAQ, I recommend reading AWS or similar 3rd party cloud services.

Solution # 4: Replicate Content Using 3rd Party Content Delivery Networks (CDN)

A content delivery network or content distribution network (CDN) is a system of computers containing copies of data, placed at various points in a network so as to maximize bandwidth for access to the data from clients throughout the network. A client accesses a copy of the data near to the client, as opposed to all clients accessing the same central server, so as to avoid bottleneck near that server. However, cdn may not offer speed as intranet is closer to user. If you got lots of home workers or remote clients or user are all spread around the globe it might be good idea use a cdn. Please note that you put your static files into someone else's network and if files are important (not for public view) do not host them using a cdn.

Solution #5: Content Replication Using Data Deduplication

Data deduplication is a specialized data compression technique for eliminating coarse-grained redundant data, typically to improve storage utilization. In the deduplication process, duplicate data is deleted, leaving only one copy of the data to be stored, along with references to the unique copy of data. Deduplication is able to reduce the required storage capacity since only the unique data is stored. You can use open source project such as opendedup and lessfs. Data Deduplication solutions has been designed as a filesystem for backup purposes. However it can be used as storage for virtual machine images and replicating your data too.

Fig.01: Data deduplication using opendedup

Conclusion

Almost all replication schema may requires low latency. So I recommend that you test all of the above methods and see what works out for you.
Usually, open source tools are good when static files are not updated, uploaded and deleted at rapid rates (e.g. 1000 of files per second).
I've also avoided discussion about commercial enterprise grade solution such as NFS over WAN using WAN accelerator such as Riverbed, HP EFS WAN accelerator and others due to cost issues.

REFERENCES
http://www.cyberciti.biz/faq/linux-unix-server-replicating-content-us-europe-asia-webservers/

Command Center