Tuesday, November 8, 2011

One Liners for Apache Log Files

SkyHi @ Tuesday, November 08, 2011


Apache One-Liners
I frequently need to look at apache log files to diagnose problems. Over time I’ve developed a series of one liners I can copy and paste to quickly analyze a log file to look for a problems, abuse, popular pages, etc.
If someone is reporting a slow site, it can be useful to see if one IP is accesing URLs much more than other IPs since this can be an indication of a poorly written crawler which is using up lots of resources. Other times a slow site might be because someone is getting high traffic so it can be useful to look at the top referrers to see where they’re linked or to look at the most popular URLs and cache that page.
The one liners are usually just a first step in diagnosing the problem. For example I might to only want to look at a certain time so instead of using tail on the transfer log, I’ll use fgrep ’2011:05:’ ./transfer.log to look at what happened between 5:00 AM and 5:59 AM. Or maybe I want to see what one IP was doing so I’ll grep the IP out and see what the top 20 URLs for it were, then maybe narrow it down further to look for only ones that were POST instead of GET. If you don’t know awk, you should really learn awk since it is great for stuff like this, otherwise you can just use grep to get the data you want. Here’s the above example using both lots of greps and one of the one liners below:
1
fgrep '2011:05:' ./transfer.log | fgrep '1.2.3.4' | grep 'POST' | awk '{freq[$7]++} END {for (x in freq) {print freq[x], x}}' | sort -rn | head -20
If the log file is large or you want to get fancy, you can do it in awk instead:
1
cat ./transfer.log | awk '$4 ~ /2011:05:/ && $1 ~ /1\.2\.3\.4/ && $6 ~ /POST/ {freq[$7]++} END {for (x in freq) {print freq[x], x}}' | sort -rn | head -20
The Art of Web and Wicked Cool Shell Scripts both cover some of this, although using their method on large log files you can end up piping too much data to sort | uniq -c which can more efficiently be handled by awk. These one liners show both methods of getting information. If you’re interested in seeing the affects piping too much data, use your biggest log file on some of them to see how faster it is when you move the sort | uniq -c in to awk.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
# top 20 URLs from the last 5000 hits
tail -5000 ./transfer.log | awk '{print $7}' | sort | uniq -c | sort -rn | head -20
tail -5000 ./transfer.log | awk '{freq[$7]++} END {for (x in freq) {print freq[x], x}}' | sort -rn | head -20
 
# top 20 URLS excluding POST data from the last 5000 hits
tail -5000 ./transfer.log | awk -F"[ ?]" '{print $7}' | sort | uniq -c | sort -rn | head -20
tail -5000 ./transfer.log | awk -F"[ ?]" '{freq[$7]++} END {for (x in freq) {print freq[x], x}}' | sort -rn | head -20
 
# top 20 IPs from the last 5000 hits
tail -5000 ./transfer.log | awk '{print $1}' | sort | uniq -c | sort -rn | head -20
tail -5000 ./transfer.log | awk '{freq[$1]++} END {for (x in freq) {print freq[x], x}}' | sort -rn | head -20
 
# top 20 URLs requested from a certain ip from the last 5000 hits
IP=1.2.3.4; tail -5000 ./transfer.log | grep $IP | awk '{print $7}' | sort | uniq -c | sort -rn | head -20
IP=1.2.3.4; tail -5000 ./transfer.log | awk -v ip=$IP ' $1 ~ ip {freq[$7]++} END {for (x in freq) {print freq[x], x}}' | sort -rn | head -20
 
# top 20 URLS requested from a certain ip excluding, excluding POST data, from the last 5000 hits
IP=1.2.3.4; tail -5000 ./transfer.log | fgrep $IP | awk -F "[ ?]" '{print $7}' | sort | uniq -c | sort -rn | head -20
IP=1.2.3.4; tail -5000 ./transfer.log | awk -F"[ ?]" -v ip=$IP ' $1 ~ ip {freq[$7]++} END {for (x in freq) {print freq[x], x}}' | sort -rn | head -20
 
# top 20 referrers from the last 5000 hits
tail -5000 ./transfer.log | awk '{print $11}' | tr -d '"' | sort | uniq -c | sort -rn | head -20
tail -5000 ./transfer.log | awk '{freq[$11]++} END {for (x in freq) {print freq[x], x}}' | tr -d '"' | sort -rn | head -20
 
# top 20 user agents from the last 5000 hits
tail -5000 ./transfer.log | cut -d\  -f12- | sort | uniq -c | sort -rn | head -20
 
# sum of data (in MB) transferred in the last 5000 hits
tail -5000 ./transfer.log | awk '{sum+=$10} END {print sum/1048576}'
REFERENCES