Wednesday, August 5, 2009

Learn how to create instant changes to maintain critical information for your DNS and web servers in this case study.

SkyHi @ Wednesday, August 05, 2009

The SOA record controls how fast updated zones propagate from the master to the slave servers, and how long resource records (RRs) are cached in caching servers before they are flushed. Both of these affect your ability to effectuate "instant" changes in the zones you maintain.

Consider the following two scenarios: moving a web server, and moving a DNS server. How instant you need these changes to be depends on how critical you consider the services to be. If you run DNS for an e-commerce site, everything is quite likely to be considered very critical, even if the powers that be want everything to be done cheaply. You need to be able to tell these powers that be how things must be done to make the changes work with DNS.

Moving a Service

Let's consider the case of moving a web server from one housing service to another. Depending on how many machines provide the service, you may or may not be faced with the whole service being offline for "a while"; perhaps by moving machines one by one, you will be able to maintain service for the whole moving period. In either case, you want DNS to serve the new address of the service as soon as it is up on the new site. I'll be moving www.penguin.bv. This is a extract of the penguin.bv zone showing the records affected:

$TTL 804800     ; 7 days
;
@ 3600 SOA ns.penguin.bv. hostmaster.penguin.bv. (
2000041300 ; serial
86400 ; refresh, 24h
7200 ; retry, 2h
3600000 ; expire, 1000h
172800 ; minimum, 2 days
)
NS ns
NS ns.herring.bv.

; WEB, both http://www.penguin.bv/ and
; http://penguin.bv/ with A records

www A 192.168.55.3
; People often send mail to webmaster@www.domain
MX 10 mail
MX 20 mail.herring.bv.
HINFO PC Tunes
@ A 192.168.55.3

Because the default TTL for the penguin.bv zone is seven days, I need to start the work more than seven days ahead of time. The first thing to do is to reduce the TTL for the web server A records. The question is, how low do you set them? Remember, all cached RRs will stay in the cache for the duration of the TTL from the moment they are cached. When moving www.penguin.bv, you'll be turning off the machine, dragging it into a car, driving for 10 minutes, and then getting it up on the new site with a new address. This should take a total of 20-30 minutes. The zone will be updated with the new record right before the machine is turned off. So, within 20 minutes after that, you want all the users to have the new address. A TTL between 5 and 10 minutes would seem appropriate (I'll use 10 minutes). So, at least seven days (the old TTL) before the machine is moved, I set these values for the A records that have to do with the web server:

www     600     A       192.168.55.3
...
@ 600 A 192.168.55.3

Now they will expire after 10 minutes in the caches. Of course, I changed the serial number as well. And then I reloaded the server and checked the logs—so did you, right?

The second problem is that all the slave servers need to be updated immediately when the update is made, or they will continue serving the old records. Many of your clients will keep caching the old updated A records, to their frustration and—not incidentally—yours too.

If you have full access to the slave servers, this is not much of a problem—a simple trick suffices. You can simply log into the slave server, remove the zone file for the updated zone, and restart named. This will cause the named to immediately request the zone from the master, which solves the problem. If you do not control the slave servers, and this is probably much more common, you need to find another way to force the transfer.

Zone Transfer by NOTIFY

The trouble with the NOTIFY request is that it travels by UDP and may be dropped by the network. The "U" in UDP is not for "Unreliable," although it might as well be. Additionally, if a slave does not support NOTIFY, you're out of luck in any case. Also, this is one instance where the very sensible time delay of NOTIFY will be frustrating. You can't know if your server is still waiting out the delay or if the NOTIFY request got lost. Fortunately, though, you can enable more logging in named.conf so you can see everything that happens:

logging {
channel my_log {
file "/var/log/named.db";
severity dynamic;
print-category yes;
print-severity yes;
};
category notify { my_log; };
category xfer-out { my_log; };
};

Run "ndc restart" to pick up the configuration change. Update the zone serial number in the zone, run "ndc trace 3," and then run "tail -f /var/log/named.db" to see what happens when you, finally, run "ndc reload" to load the updated zone:

29-Apr-2000 16:39:50.897 ns_notify(penguin.bv, IN, SOA):
ni 0x400bf728, zp 0x80e188c, delay 24
29-Apr-2000 16:40:14.899 sysnotify(penguin.bv, IN, SOA)
29-Apr-2000 16:40:14.901 Sent NOTIFY for "penguin.bv IN SOA"
(penguin.bv); 1 NS, 1 A
29-Apr-2000 16:40:14.916 Received NOTIFY answer from 192.168.226.3 for
"penguin.bv IN SOA"
29-Apr-2000 16:40:15.084 zone transfer (AXFR) of "penguin.bv" (IN)
to [192.168.226.3].8478

Actually, there will be more unless you "grep" the tail output, but the preceding contains the interesting bits. First, you see that named decides that a NOTIFY is in order, and to delay it 24 seconds. Then, the time to send the NOTIFY arrives and it is sent. A response is promptly received; in short order, the zone is transferred by the slave. This is what is supposed to happen, for each and every slave server. If it does not, it should suffice to do a "ndc restart", because named will issue NOTIFY requests when starting, just in case a zone changed since the last reload or restart. In this manner, you should be able to get the slaves reloaded promptly.

Zone Transfer by Other Methods

If all your slaves do not implement NOTIFY and you do not have full access, you need to get the slaves to check the zone for updates frequently enough so that the zone transfer happens fast enough. Controlling this is what the refresh value is for. If you would like to have the zone transferred within 10 minutes of an update, set the refresh period to 10 minutes. But be sure to do it more than one old refresh interval before the change takes place so that the new, decreased refresh interval is picked up in time. In cases such as the moving of an important server, a refresh period of one minute would not be out of place. And this is quite possibly the simplest way to accomplish this in any case—if you plan ahead. But do increase it again afterward.

The last technique is to call up the admins of the slave servers and arrange for them to be available when the move is made. Then phone around, getting the admins to reload the zone by force, by removing the zone file from the slave servers and reloading as described earlier. This works best if there aren't many of them to call.

Moving a Nameserver

Moving a nameserver is probably easier technically, but involves more people than just the nameserver admin. It also involves all the servers that are slaves of it, and all the servers it is a slave for, as well as all those that delegate domains to it. If you count all this, you find that many things depend on that nameserver, both the function it performs and its IP address. Additionally, if the nameserver is used as a resolving, recursive nameserver by someone, all the resolv.conf files need to be fixed.

Actually, it's not as many as that. Spurious glue records are now avoided, discouraged, and even handled as errors. This means that the number of glue records in need of change is small—it is probably one. It should be the one in the domain above yours that points to your nameserver inside your domain. If you recall the emperor zone within the penguin.bv zone, it contained these records:

emperor         NS      ns.emperor
NS ns.herring.bv.
ns.emperor A 192.168.56.3

The people delegating name service to your nameserver will have a glue record analogous to that in their zone. In this case, the bv TLD admins will have something like this:

...
penguin NS ns.penguin.bv.
penguin NS ns.herring.bv.
ns.penguin A 192.168.55.2
ns.herring A 192.168.226.3


...

So, before moving a server, you will need to notify the admins of the zone above you. It is quite likely, as is the case for penguin.bv, that this is a TLD registrar, and you have to cope with the registrars forms, requirements, and processing time. This means that you don't know when the registrar will change their glue record. But most registrars will not change the glue record unless there is a nameserver giving authoritative answers for the zone at the new address. This precludes you from just moving the name server as soon as possible after the registrar has changed the record. In addition, the TTL on a TLD is likely to be one day, but your superior zone may not be a TLD; in any case, the TTL may be a week, or more. More than week may pass before the entire world knows that your nameserver has moved. That's a lot of time. Plan ahead.

Whether you're right under a TLD or you admin a corporate sub-domain, a good course of action is to first set up a new server and change the NS records within the zone. Then notify the domain above you of the change; only when the glue record has been changed, and has propagated to the slaves and expired from caches, disable the old server. If need be, you can do this by way of a third machine that acts as master temporarily while you move the real host.

Remember, though, the NS record must point to a domain name, and that domain name must point to an A record. It cannot point to a CNAME record. Whenever you move a nameserver, play it straight. Or run quickly—to put things right.



Reference: http://www.informit.com/articles/article.aspx?p=19792