Find an IP address and tons more in Apache logs with Regular Expression and egrep or grep
I wanted to look through my apache logs and pull out a list of IPs. I wanted to use SSH. I am kinda new to regular expression so it took me 3 tries, but I ended up with this big string of junk to do just that. The -o tag makes it only spit out the IP out of each line while the rest will grab any grouping of numbers separated by 4 periods.
egrep -o ^[0-9]+\.[0-9]+\.[0-9]+\.[0-9]+
the ^ means beginning of line
[0-9] means any number, 1-10
the + means one or more times in a row
the \ means literally interpret the next character
Then i went a step further and wanted to figure out how many UNIQUE users I had visiting for a particular access log.
cat access.log.0 | egrep -o ^[0-9]+\.[0-9]+\.[0-9]+\.[0-9]+ | sort | uniq -c | sort
note: uniq -u -c seemed to leave duplicates... this worked a little slower, but reliably. I'm probably doing something wrong there.
Okay so now the NEXT step! Lets do a reverse lookup on all those IPs to see some hostnames... why not?!?!?! We're going crazy here.
for ip in `cat access.log.0 | egrep -o ^[0-9]+\.[0-9]+\.[0-9]+\.[0-9]+ | sort | uniq`; do dig -x $ip | grep PTR | egrep -v ^\; >> hosts.txt; done
The for loop above lists all the reverse PTR records for every unique IP in the entire log file. The output is dumped to hosts.txt. I had to do a little hackery and exclude lines starting with ; because i was getting duplicates, but it worked. Phew, that command took awhile. I hope my shared web host doesn't mind about 4,000 reverse DNS lookups.
So what will he do next?! Well, to avoid running that big command ever again, we dumped the results to hosts.txt. Try doing a grep on that for .gov, or something like that. That would look like this:
grep \.gov\.$ hosts.txt
That should roughly be all the government addresses that hit my web server.
Then you could count how many that is, too:
grep \.gov\.$ hosts.txt | uniq | wc -l