There are a couple of security auditing frameworks out there, and the temptation is high on creating your own; either in Perl, Ruby, Python and why not in PHP as well.
Needles to say, I too was tempted in creating my own framework. Ideas kept flowing in, the project has been started and then BAM, I’ve read an interesting article on GNUCITIZEN, which made me rethink my strategy…
One of the comments pointed it out very well:
most of the stuff we need is on the shell already. pentesting frameworks is like the new security-testing hype. first we had hundreds of portscanners, then hundreds of webapp MiTM proxies, then hundreds of fuzzers, then hundreds of SQL injectors, now it’s about pentesting frameworks :)
So instead of starting to write redundant code, I started to learn already available command line tools, which have years of development behind and fill in almost every aspect they need to.
Basically I’m building my framework around already available tools, and only code up things that do not exist, or for some very particular cases.
So why WGet?
Well I had to start with something my series of articles (it’s gonna be a series), and wget
seemed to be a good starting point.
If you’ve never dealt with wget
(which I sincerely doubt), the following description best describes it:
GNU Wget is a free software package for retrieving files using HTTP, HTTPS and FTP, the most widely-used Internet protocols. It is a non-interactive commandline tool, so it may easily be called from scripts, cron jobs, terminals without X-Windows support, etc
Without further useless rambling let’s see in which scenarios you would use wget; apart from downloading psyBNC
archives, like seen on many h4×00r websites.
Website crawling
There are a couple of tools that facilitate website crawling, I even mentioned one in my Intercepting Proxies? article with the difference that liveHTTPHeaders may be used for passive crawling of websites…
So how would we go on crawling a website with wget
?
wget -r -nd --spider -o links.txt http://insanesecurity.info
Where:
- r – recursive crawling
- nd – don’t create directories
- spider – do not save pages, discard after collecting links from them
- o – save output to file links.txt
But what if we would want to restrict the crawling only under a directory, and filter out CSS, images and Javascript files?
wget -r -nd --spider -o links.txt -np -R js,css,jpg,png,gif http://insanesecurity.info/blog/
Where:
- np – do not go to parent directory
- R – one or more extensions to reject (comma separated)
After the command finishes the content of links.txt
would look like this (assuming you’ve run the first command):
Spider mode enabled. Check if remote file exists.<br />--2009-07-07 16:46:58-- http://insanesecurity.info/<br />Resolving insanesecurity.info... 93.115.201.3<br />Connecting to insanesecurity.info|93.115.201.3|:80... connected.<br />HTTP request sent, awaiting response... 302 Found<br />Location: http://insanesecurity.info/blog/ [following]<br />Spider mode enabled. Check if remote file exists.<br />--2009-07-07 16:47:01-- http://insanesecurity.info/blog/<br />Connecting to insanesecurity.info|93.115.201.3|:80... connected.<br />HTTP request sent, awaiting response... 200 OK<br />Length: unspecified [text/html]<br />Remote file exists and could contain links to other resources -- retrieving.<br /><br />--2009-07-07 16:47:01-- http://insanesecurity.info/blog/<br />Connecting to insanesecurity.info|93.115.201.3|:80... connected.<br />HTTP request sent, awaiting response... 200 OK<br />Length: unspecified [text/html]<br />Saving to: `index.html'
From this point, the retrieval of links is just a mater of using grep
, cut
, sort
and uniq
.
cat links.txt | grep -P "\-\-\d{4}" | cut -d " " -f 4 | sort | uniq
And the output would be like:
http://insanesecurity.info/<br />http://insanesecurity.info/2009/01/hacking-yahoogmailhotmail-accounts-a-z-guide/<br />http://insanesecurity.info/2009/01/javascript-userscript-keylogger/<br />http://insanesecurity.info/2009/01/logging-the-http-requests/<br />http://insanesecurity.info/2009/01/password-insecurity-wordlists-dictionaries/<br />http://insanesecurity.info/2009/01/the-future-of-av-or-not/<br />http://insanesecurity.info/2009/01/the-hackers-underground-handbook-review/<br />http://insanesecurity.info/2009/01/useratuh-frontend-to-backend-encryption/
Copying websites
Or website mirroring as people use to call it.
There are a couple of reasons why you would do this;
- To have a copy which you can transport on a CD/DVD/Memory card/etc, having the possibility to convert the links to point to local files.
- Content to feed email scrappers
For our first scenario, you would run wget
in the following manner:
wget -m -k -p -np http://www.freebsd.org/doc/en_US.ISO8859-1/books/handbook/
Where:
- m – mirror website
- k – convert html links to local files
- p – get page dependencies (css, images)
- np – do not go to parent folders
And as far as using it for scrapping purpose, we use the most simplistic commands:
wget http://tinyurl.com/nqa48q<br />cat downloaded-file | grep -o -P "\w+\[at\]\w+" > emails.txt<br />
And this way I have gathered a list of 40 email addresses… a sample of them:
ureachnirav[at]yahoo<br />reachjag[at]yahoo<br />g2[at]g2designindia<br />g2design[at]rediffmail<br />archsumitjoshi[at]yahoo<br />joshi[at]hexagon<br />rohankarswapnil[at]yahoo<br />sandy004[at]yahoo<br />idc[at]iitb<br />visakan[at]gmail<br />
Of course that for replacing the [at] in them, we can simply use sed
:
cat emails.txt | sed -r "s/\[at\]/@/g" > normal-emails.txt<br />
Blog spam
This is also possible (and simple) to achieve with wget
and minor interference from grep
and sed
, but for this one I will not post an example… There are already hundreds of spammers out there, so why add more to the list? If interest exists in wget
from your part you will find out how…
Of course for this you will need to script the behavior (bash, bat, perl, python, etc), or create a lengthy command line.
FTP Copy
As mentioned at the beginning of article wget
can very well work with the ftp protocol as well.
wget -r --ftp-user=anonymous --ftp-password=some@email.com ftp://ftp.ro.freebsd.org/pub/FreeBSD/<br />
Here I think is no need to explain the command line arguments, they are pretty obvious.
Anonymous mode
When I’m referring to anonymous mode, I’m referring to channel wget
requests through proxy servers. First you need to configure your .wgetrc
file. If you haven’t got one, you may as well create it now.
#############################<br />###<br />### Sample Wget initialization file .wgetrc<br />###<br /><br />## You can use this file to change the default behaviour of wget or to<br />## avoid having to type many many command-line options. This file does<br />## not contain a comprehensive list of commands -- look at the manual<br />## to find out what you can put into this file.<br />##<br />## Wget initialization file can reside in /usr/local/etc/wgetrc<br />## (global, for all users) or $HOME/.wgetrc (for a single user).<br />##<br />## To use the settings in this file, you will have to uncomment them,<br />## as well as change them, in most cases, as the values on the<br />## commented-out lines are the default values (e.g. "off").<br /><br />##<br />## Global settings (useful for setting up in /usr/local/etc/wgetrc).<br />## Think well before you change them, since they may reduce wget's<br />## functionality, and make it behave contrary to the documentation:<br />##<br /><br /># You can set retrieve quota for beginners by specifying a value<br /># optionally followed by 'K' (kilobytes) or 'M' (megabytes). The<br /># default quota is unlimited.<br />#quota = inf<br /><br /># You can lower (or raise) the default number of retries when<br /># downloading a file (default is 20).<br />#tries = 20<br /><br /># Lowering the maximum depth of the recursive retrieval is handy to<br /># prevent newbies from going too "deep" when they unwittingly start<br /># the recursive retrieval. The default is 5.<br />#reclevel = 5<br /><br /># Many sites are behind firewalls that do not allow initiation of<br /># connections from the outside. On these sites you have to use the<br /># `passive' feature of FTP. If you are behind such a firewall, you<br /># can turn this on to make Wget use passive FTP by default.<br />#passive_ftp = off<br /><br /># The "wait" command below makes Wget wait between every connection.<br /># If, instead, you want Wget to wait only between retries of failed<br /># downloads, set waitretry to maximum number of seconds to wait (Wget<br /># will use "linear backoff", waiting 1 second after the first failure<br /># on a file, 2 seconds after the second failure, etc. up to this max).<br />waitretry = 10<br /><br />##<br />## Local settings (for a user to set in his $HOME/.wgetrc). It is<br />## *highly* undesirable to put these settings in the global file, since<br />## they are potentially dangerous to "normal" users.<br />##<br />## Even when setting up your own ~/.wgetrc, you should know what you<br />## are doing before doing so.<br />##<br /><br /># Set this to on to use timestamping by default:<br />#timestamping = off<br /><br /># It is a good idea to make Wget send your email address in a `From:'<br /># header with your request (so that server administrators can contact<br /># you in case of errors). Wget does *not* send `From:' by default.<br />#header = From: Your Name <br /><br /># You can set up other headers, like Accept-Language. Accept-Language<br /># is *not* sent by default.<br />#header = Accept-Language: en<br /><br /># You can set the default proxies for Wget to use for http and ftp.<br /># They will override the value in the environment.<br />http_proxy = http://1.2.3.4:8080/<br />#ftp_proxy = http://proxy.yoyodyne.com:18023/<br /><br /># If you do not want to use proxy at all, set this to off.<br />use_proxy = on<br /><br /># You can customize the retrieval outlook. Valid options are default,<br /># binary, mega and micro.<br />#dot_style = default<br /><br /># Setting this to off makes Wget not download /robots.txt. Be sure to<br /># know *exactly* what /robots.txt is and how it is used before changing<br /># the default!<br />#robots = on<br /><br /># It can be useful to make Wget wait between connections. Set this to<br /># the number of seconds you want Wget to wait.<br />#wait = 0<br /><br /># You can force creating directory structure, even if a single is being<br /># retrieved, by setting this to on.<br />#dirstruct = off<br /><br /># You can turn on recursive retrieving by default (don't do this if<br /># you are not sure you know what it means) by setting this to on.<br />#recursive = off<br /><br /># To always back up file X as X.orig before converting its links (due<br /># to -k / --convert-links / convert_links = on having been specified),<br /># set this variable to on:<br />#backup_converted = off<br /><br /># To have Wget follow FTP links from HTML files by default, set this<br /># to on:<br />#follow_ftp = off<br />
I saved the file in my C:/Windows
folder, but you may save it any other place you like. Under Linux you may already have this file in your /etc
folder, so just modify it there.
As you may notice in the configuration file above (If you’ve looked closely) I have enabled the http_proxy
and set up a proxy.
set WGETRC=C:/Windows/.wgetrc<br />wget --proxy=on http://insanesecurity.info/blog/<br />
The first line is necessary under Windows if you haven’t set up till know the custom .wgetrc
, while the second command enables the proxy and executes a request.
Tweaking it
This is just a quick intro on the most common usages of wget
… Besides the ones mentioned here it also comes with a handful of other configuration options which you may look into…
Hopefully this article will be read by those who all day long write scrappers and spiders… I’m tired of bumping constantly over those types of scripts :)
Reference: http://insanesecurity.info/blog/wget-all-the-way