Operating System - HP-UX
1753365 Members
5188 Online
108792 Solutions
New Discussion юеВ

Re: Urgent Question about grepping thru the logs

 
SOLVED
Go to solution
Allanm
Super Advisor

Urgent Question about grepping thru the logs


I have a list of fraudulent IPs(~2000) I need to search through my apache web logs. I have the logs(~450) from all my web servers in one place from the last 3 months , what would be the best way to grep those IPs on the gizipped logs.

Please help!

Thanks,
Allan
14 REPLIES 14
Allanm
Super Advisor

Re: Urgent Question about grepping thru the logs

Please Help!
James R. Ferguson
Acclaimed Contributor
Solution

Re: Urgent Question about grepping thru the logs

Hi Allan:

You might create a file of your IP addresses -- one per line, called 'tmp/IPS' and then do:

#!/usr/bin/sh
cd /path_to_logs
for FILE in $(ls)
do
echo ">>> '${FILE}' <<<"
gzcat -c ${FILE}|grep -f /tmp/IPS
done

Regards!

...JRF...
Allanm
Super Advisor

Re: Urgent Question about grepping thru the logs

That is working but going very very slow.

Any way to speed it up.

Thanks,
Allan.
Michael Steele_2
Honored Contributor

Re: Urgent Question about grepping thru the logs

Hi

The gzcat file is probably consuming all your cpu time by compressing each file. I've you got the room, you can speed things up by removing this.
Support Fatherhood - Stop Family Law
Allanm
Super Advisor

Re: Urgent Question about grepping thru the logs

I am afraid the space is at a premium with these huge log files. What I am planning is to split the IP files in 6 files and then run the script 6 times in parallel. Let me know if you any more good ideas.

Thanks,
Allan
Hein van den Heuvel
Honored Contributor

Re: Urgent Question about grepping thru the logs

I don't know how fast "grep -f" works, never having benchmarked that.
Given the number of lines you may want to help is a little if you can by not having just the IP's there, but perhaps ANCHORING them to the begin of the line ^aa.bb.cc.dd to allow for a quicker yeah-nay decision.
(if appropriate... you did not share any log layout).

As expressed earlier, it is not unlikely to be the gzcat which consumes more resources. You really should verify that (with TOP ?)

If 'grep' is the top consumer than consider re-writting in AWK or PERL initially loading those 2000 IPs into a associtive array, then read the log, find the IP and look up in the array.

Something roughly like:
$ cat > IP.tmp
1.2.3.4
2.3.4.5
4.5.6.7
$ cat > LOG.tmp
aap 5.6.7.8
noot 1.2.3.4
mies 4.3.2.1
$ awk 'BEGIN {while (getline ip < "IP.tmp"){ips[ip]=1}} $2 in ips' LOG.tmp
noot 1.2.3.4

Good luck!
Hein van den Heuvel
HvdH Performance Consulting
Patrick Wallek
Honored Contributor

Re: Urgent Question about grepping thru the logs

>>run the script 6 times in parallel

How many processors in your server? If you have less than 6 you may do more harm than good.

I would run 1 less script than the number of processors in the system (4 processors -- 3 scripts running).
Steven Schweda
Honored Contributor

Re: Urgent Question about grepping thru the logs

> If 'grep' is the top consumer than consider
> re-writting in AWK or PERL [...]

Sometimes it pays to write a real computer
program in a real, compiled programming
language. C, for example, is popular these
days. Or so I hear. (I think that it even
has arrays.)

Sorry, if this sounds too radical.
Hein van den Heuvel
Honored Contributor

Re: Urgent Question about grepping thru the logs

>> Sometimes it pays to write a real computer
program in a real, compiled programming
language.

:-)

Yes. And hashed lookups and all that good stuff.

Thank you Steve.
We needed that quick sanity check.

Actually, it would not surprise me if awk just does a linear search for array keys, which would suck (cpu).
Best I know Perl builds in index tree, but that may be wishful thinking. I have never needed to find out. But some day...

Hein.