Saturday, January 16, 2010

Geek to Live: Mastering Wget

SkyHi @ Saturday, January 16, 2010

Your browser does a good job of fetching web documents and displaying them, but there are times when you need an extra strength download manager to get those tougher HTTP jobs done.

A versatile, old school Unix program called Wget is a highly hackable, handy little tool that can take care of all your downloading needs. Whether you want to mirror an entire web site, automatically download music or movies from a set of favorite weblogs, or transfer huge files painlessly on a slow or intermittent network connection, Wget's for you.

Wget, the "non-interactive network retriever," is called at the command line. The format of a Wget command is:

wget [option]... [URL]...

The URL is the address of the file(s) you want Wget to download. The magic in this little tool is the long menu of options available that make some really neat downloading tasks possible. Here are some examples of what you can do with Wget and a few dashes and letters in the [option] part of the command.

Mirror an entire web site

Say you want to backup your blog or create a local copy of an entire directory of a web site for archiving or reading later. The command:

wget -m http://ginatrapani.googlepages.com

Will save the two pages that exist on the ginatrapani.googlepages.com site in a folder named just that on your computer. The -m in the command stands for "mirror this site."

Say you want to retrieve all the pages in a site PLUS the pages that site links to. You'd go with:

wget -H -r --level=1 -k -p http://ginatrapani.googlepages.com

This command says, "Download all the pages (-r, recursive) on http://ginatrapani.googlepages.com plus one level (—level=1) into any other sites it links to (-H, span hosts), and convert the links in the downloaded version to point to the other sites' downloaded version (-k). Oh yeah, and get all the components like images that make up each page (-p)."

Warning: Beware, those with small hard drives! This type of command will download a LOT of data from sites that link out a lot (like blogs)! Don't try to backup the Internet, because you'll run out of disk space!

Resume large file downloads on a flaky connection

Say you're piggybacking the neighbor's wifi and every time someone microwaves popcorn you lose the connection, and your video download (naughty you!) keeps crapping out halfway through. Direct Wget to resume partial downloads for big files on intermittent connections.

To set Wget to resume an interrupted download of this 16MB "Mavericks Surf Highlights 2006: Wipeouts" short from Google Video, use:

wget -c --output-document=mavericks.avi "http://vp.video.google.com/videodownload?version=0&secureurl=qgAAAJCWpcRd5eI2k3sm3LWJZMjLyLFiTxk_KqUrRYbrzLTEw8hwMV30m3MRz6rYMTxGqWIfWMQjNJsP0fNXUMc34jzoPcy6z-qHde5UVD29Po6_9b_-d3J5AQpVROUPRqzkJriangEl2IMkKBJd08Q7TTJIAC_r6XID-fNYPLKHm1KRvx0smOslivNLGmyZsCsZmVNVN0jaw5-dloWtzPlI86zIubh1XvJsTg2u_YaHcaAB&sigh=-BbV2h_bIFVuVg4D-h6MUTxuErM&begin=0&len=139433&docid=6059494448346363884"

(Apologies for the humungous, non-wrapping URL.)

The -c ("continue") option sets Wget to resume a partial download if the transfer is interrupted. You'll also notice the URL is in quotes, necessary for any address with &'s in it. Also, since that URL is so long, you can specify the name of the output file explicitly - in this case, mavericks.avi.

Schedule hourly downloads of a file

The nice thing about any command line script is that it's very easy to automate. For instance, if there was a constantly-changing file that you wanted to download every hour, say, you could use cron or Windows Task Scheduler and Wget to do just that, or if there was a very large file you wanted your computer to fetch in the middle of the night while you slept instead of right this moment when you need all your bandwidth to get other work done. You could easily schedule the Wget command to run at a later time.

As proof of concept, yesterday I scheduled an hourly download of Lifehacker's daily traffic chart to run automatically. The command looked like this:

wget --output-document=traffic_$(date +\%Y\%m\%d\%H).gif "http://sm3.sitemeter.com/rpc/v6/server.asp?a=GetChart&n=9&p1=sm3lifehacker&p2=&p3=3&p4=0&p5=64\%2E249\%2E116\%2E138&p6=HTML&p7=1&p8=\%2E\%3Fa\%3Dstatistics&p9=&rnd=7209"

Notice the use of %Y and %m datetime parameters which result in unique filenames, so each hour the command wouldn't overwrite the file with the same name generated the hour before. Note also that the %'s have to be escaped with a backslash.

Just for fun I threw together a little animated gif of the hourly chart image, that displays the movement of Lifehacker's traffic yesterday from 2PM to midnight:

Automatically download music

This last technique, suggested by Jeff Veen, is by far my favorite use of Wget. These days there are tons of directories, aggregators, filters and weblogs that point off to interesting types of media. Using Wget, you can create a text file list of your favorite sites that say, link to MP3 files, and schedule it to automatically download any newly-added MP3's from those sites each day or week.

First, create a text file called mp3_sites.txt, and list URLs of your favorite sources of music online one per line (like http://del.icio.us/tag/system:filetype:mp3 or stereogum.com). Be sure to check out my previous feature on how to find free music on the web for more ideas.

Then use the following Wget command to go out and fetch those MP3's:

wget -r -l1 -H -t1 -nd -N -np -A.mp3 -erobots=off -i mp3_sites.txt

That Wget recipe recursively downloads only MP3 files linked from the sites listed in mp3_sites.txt that are newer than any you've already downloaded. There are a few other specifications in there - like to not create a new directory for every music file, to ignore robots.txt and to not crawl up to the parent directory of a link. Jeff breaks it all down in his original post.

The great thing about this technique is that once this command is scheduled, you get an ever-rotating jukebox of new music Wget fetches for you while you sleep. With a good set of trusted sources, you'll never have to go looking for new music again - Wget will do all the work for you.

Install Wget

Wanna give all this a try? Windows users, you can download Wget here; Mac users, go here. An alternative for Windows users interested in more Linuxy goodness is to download and install the Unix emulator Cygwin which includes Wget and a whole slew of other 'nixy utilities, too.

For the full take on all of Wget's secret options sauce, type wget --help or check out the full-on Wget manual online. No matter what your downloading task may be, some combination of Wget's extensive options will get the job done just right.


Reference: http://lifehacker.com/161202/geek-to-live--mastering-wget