Operating System - HP-UX
1850038 Members
2559 Online
104049 Solutions
New Discussion

Urgent Question about grepping thru the logs

 
SOLVED
Go to solution
Allanm
Super Advisor

Urgent Question about grepping thru the logs


I have a list of fraudulent IPs(~2000) I need to search through my apache web logs. I have the logs(~450) from all my web servers in one place from the last 3 months , what would be the best way to grep those IPs on the gizipped logs.

Please help!

Thanks,
Allan
14 REPLIES 14
Allanm
Super Advisor

Re: Urgent Question about grepping thru the logs

Please Help!
James R. Ferguson
Acclaimed Contributor
Solution

Re: Urgent Question about grepping thru the logs

Hi Allan:

You might create a file of your IP addresses -- one per line, called 'tmp/IPS' and then do:

#!/usr/bin/sh
cd /path_to_logs
for FILE in $(ls)
do
echo ">>> '${FILE}' <<<"
gzcat -c ${FILE}|grep -f /tmp/IPS
done

Regards!

...JRF...
Allanm
Super Advisor

Re: Urgent Question about grepping thru the logs

That is working but going very very slow.

Any way to speed it up.

Thanks,
Allan.
Michael Steele_2
Honored Contributor

Re: Urgent Question about grepping thru the logs

Hi

The gzcat file is probably consuming all your cpu time by compressing each file. I've you got the room, you can speed things up by removing this.
Support Fatherhood - Stop Family Law
Allanm
Super Advisor

Re: Urgent Question about grepping thru the logs

I am afraid the space is at a premium with these huge log files. What I am planning is to split the IP files in 6 files and then run the script 6 times in parallel. Let me know if you any more good ideas.

Thanks,
Allan
Hein van den Heuvel
Honored Contributor

Re: Urgent Question about grepping thru the logs

I don't know how fast "grep -f" works, never having benchmarked that.
Given the number of lines you may want to help is a little if you can by not having just the IP's there, but perhaps ANCHORING them to the begin of the line ^aa.bb.cc.dd to allow for a quicker yeah-nay decision.
(if appropriate... you did not share any log layout).

As expressed earlier, it is not unlikely to be the gzcat which consumes more resources. You really should verify that (with TOP ?)

If 'grep' is the top consumer than consider re-writting in AWK or PERL initially loading those 2000 IPs into a associtive array, then read the log, find the IP and look up in the array.

Something roughly like:
$ cat > IP.tmp
1.2.3.4
2.3.4.5
4.5.6.7
$ cat > LOG.tmp
aap 5.6.7.8
noot 1.2.3.4
mies 4.3.2.1
$ awk 'BEGIN {while (getline ip < "IP.tmp"){ips[ip]=1}} $2 in ips' LOG.tmp
noot 1.2.3.4

Good luck!
Hein van den Heuvel
HvdH Performance Consulting
Patrick Wallek
Honored Contributor

Re: Urgent Question about grepping thru the logs

>>run the script 6 times in parallel

How many processors in your server? If you have less than 6 you may do more harm than good.

I would run 1 less script than the number of processors in the system (4 processors -- 3 scripts running).
Steven Schweda
Honored Contributor

Re: Urgent Question about grepping thru the logs

> If 'grep' is the top consumer than consider
> re-writting in AWK or PERL [...]

Sometimes it pays to write a real computer
program in a real, compiled programming
language. C, for example, is popular these
days. Or so I hear. (I think that it even
has arrays.)

Sorry, if this sounds too radical.
Hein van den Heuvel
Honored Contributor

Re: Urgent Question about grepping thru the logs

>> Sometimes it pays to write a real computer
program in a real, compiled programming
language.

:-)

Yes. And hashed lookups and all that good stuff.

Thank you Steve.
We needed that quick sanity check.

Actually, it would not surprise me if awk just does a linear search for array keys, which would suck (cpu).
Best I know Perl builds in index tree, but that may be wishful thinking. I have never needed to find out. But some day...

Hein.

Allanm
Super Advisor

Re: Urgent Question about grepping thru the logs

Grep is taking most of the cpu cycles -

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
6179 root 25 0 68140 7824 652 R 100 0.0 242:11.62 grep
6739 root 25 0 64828 4516 648 R 100 0.0 93:35.20 grep
6771 root 25 0 67736 7420 652 R 100 0.0 78:34.72 grep
6915 root 18 0 67864 7492 652 R 100 0.0 15:52.94 grep
6919 root 25 0 68136 7828 652 R 100 0.0 14:33.67 grep
6800 root 25 0 68140 7780 652 R 100 0.0 65:17.68 grep
6799 root 18 0 4116 484 324 S 0 0.0 0:04.15 zcat

If PERL or AWK can help is speeding up can some specify that so that gzcat and grep can be replaced with that.

Dennis Handly
Acclaimed Contributor

Re: Urgent Question about grepping thru the logs

>gzcat and grep can be replaced with that.

You can replace the grep but you'll still need the gzcat.

You can try fgrep so it doesn't need to do pattern matching. (Otherwise you would also have to quote the "." in your IPs.)

The grep source shows it does read the -f file into memory.
Hein van den Heuvel
Honored Contributor

Re: Urgent Question about grepping thru the logs

>> If PERL or AWK can help is speeding up can some specify that so that gzcat and grep can be replaced with that.

My example was supposed to show that.
The assumption in that example was that you can readily find the IP address in the log file as being in a fixed 'word' or 'column'.
For the purpose of the example I used the second field/word, represented by the '$2 in ips'

As I hinted already, you'll need to povide us with a representative snippet from your log to help you grab the IP out of that.

And do heed Dennis advice (and mine) advice to carefully construct the grep search file to be quick-failing regular expressions with as little as possible wildcard. Notably the '.' in the ip address

For example, using my sample file and adding 2 records:

$ cat >> LOG.tmp
noot 1.2.3.4
1x2x3x4

now returns

$ GREP -f IP.tmp LOG.tmp
noot 1.2.3.4
1.2.3.4 noot
1x2x3x4

The two new lines should probably NOT be found. And they probably will not be found in the real log, but in the mean time grep is wasting CPU cycles trying to look for them!

Hein.


James R. Ferguson
Acclaimed Contributor

Re: Urgent Question about grepping thru the logs

Hi (again) Allan:

> Dennis: You can try fgrep so it doesn't need to do pattern matching. (Otherwise you would also have to quote the "." in your IPs.)

Yes, I realized that too after I posted before dinner last night. The quoting can be handled by Perl (below).

> Dennis: The grep source shows it does read the -f file into memory.

I have often wondered about that! I assumed that it would for speed. Thnaks very much for looking.

Anyway, here's a quick Perl script that might speed things up. The dot characters in the "token" file of IP addresses are escaped "automatically". The script stops analyzing the pattern list as soon as a match to a line in the file is found and moves along to the next line of the file.

# cat ./ngrep
#!/usr/bin/perl
use strict;
use warnings;
my @tokens;
sub loadtokens {
my ($file) = @_;
local $/;
my $fh;
open( $fh, '<', $file ) or die "Can't open '$file'\n";
$_ = <$fh>;
@tokens = split;
}
my $tokenfile = shift or die "Token file expected\n";
loadtokens $tokenfile;
while (@ARGV) {
my $fh;
my $file = shift;
unless ( open( $fh, "gzcat -c $file|" ) ) {
warn "Can't open '$file'\n";
next;
}
while (<$fh>) {
PATTERN:
for my $pattern (@tokens) {
if (m/\Q$pattern/) {
print $file, ': ', $_;
last PATTERN;
}
}
}
close $fh;
}
1;

...run the script like:

# ./ngrep tokenfile file1 file2 file3...

The "tokenfile" should contain your IP addresses to be matched, one per line. The list of files on the command line are your logs to be analyzed.

You might try a timing test with just a few logs the original way and then with this code. I haven't had time to benchmark this.

Regards!

...JRF...
Dennis Handly
Acclaimed Contributor

Re: Urgent Question about grepping thru the logs

>Steven: Sometimes it pays to write a real computer program in a real, compiled programming language. C, for example, is popular these days.

How about C++ using STL?

My strtok(3) loop can probably be optimized better.
Insert:
if (len < min_len) min_len = len;
if (len > max_len) max_len = len;
const char *p = strdup(buf);
result = IP_set.insert(p);
Search:
const char *p = strtok(buf2, " \t[]!@#$%^&*()_-=+{}|\\;:'\",<>/?");
while (p) {
len = strlen(p);
if (len >= min_len && len <= max_len) {
if (IP_set.find(p) != IP_set.end()) {
printf("Found %s, in: %s\n", p, buf);
}
}
p = strtok(NULL, " \t[]!@#$%^&*()_-=+{}|\\;:'\",<>/?");
}

Using some random large files:
$ a.out itrc_IP.data itrc_IP.scan /stand/vmunix /var/adm/wtmp*
Duplicate key: 15.00.00.00
Elements in the container: 2
Found 15.00.00.00, in: abc 15.00.00.00 def
Found 16.00.00.00, in: sam[16.00.00.00] 1.2.3.4
Found 15.00.00.00, in: 15.00.00.00 is bad
files scanned 5, lines scanned 364676
tokens scanned 544502, tokens looked up 397339

Changing to read from stdin would be easy, just don't close it.