1839761 Members
3033 Online
110154 Solutions
New Discussion

Re: Sort question.

 
SOLVED
Go to solution
Sean OB_1
Honored Contributor

Sort question.

This should be easy.

I have a web log file with format of:
66.196.65.25 - - [17/Sep/2002:09:00:57 -0500] "GET /robots.txt

I need to sort by date. So sort by month (Sep), day (17), time (09:00:57).

What's the easiest way to do so?

TIA,

Sean
15 REPLIES 15
James R. Ferguson
Acclaimed Contributor

Re: Sort question.

Hi Sean:

I'd use something like this:

# sort -t/ -k3 -kM2 -k1 filein

This uses the "/" as the field delimiter. The '-M' signifies the third field should be compared as months. See the 'sort' man pages for an explanation.

Regards!

...JRF...
Sean OB_1
Honored Contributor

Re: Sort question.

James,

Wouldn't that only sort on Month and then the first field, which starts with IP address?

I've read the man on sort and didn't see any way to either sort on a column, or specify more than one deliminator, say "/" and "[".

Sean
James R. Ferguson
Acclaimed Contributor

Re: Sort question.

Hi (again) Sean:

Sorry, the month is the second field:

Yes, the first field really consists of the IP address through the day. The second field is a 3-character month (cleanly), and the third field is a year followed by a time. Ugly.

It would be helpful to re-format the data before sorting and/or to add a field start/field end to the keys specified.

The "key" [no pun intended, here] is the '-M' option to handle sorting months.

Regards!

...JRF...
Sean OB_1
Honored Contributor

Re: Sort question.

Ok, so now I can do the month field, but how can I also sort on the day and time fields?

That's the part that I couldn't figure out. I can't sort on field one, as that will put it in ip address order, when I need only the day portion of it. Ditto for field 3, where I only need the time portion of it.

Sean
harry d brown jr
Honored Contributor

Re: Sort question.


Sean,

You should convert the date/time to the internal format and sort on that! The use of perl comes to mind here!

live free or die
harry
Live Free or Die
Sean OB_1
Honored Contributor

Re: Sort question.

Another caveat. The file needs to stay in the same format or the web log analyzer won't be able to parse it.

harry d brown jr
Honored Contributor

Re: Sort question.

Does gnu have a sort?

Of course you are going to require that a "set" of transactions stay together??

What you need is a real web log analizer.

live free or die
harry
Live Free or Die
Sean OB_1
Honored Contributor

Re: Sort question.

Actually I have a pretty good analyzer, but it doesn't seem to like it when the records come in out of order.

I had to combine log files from about 10 sites into one file, which means that the records are out of order.

So I only need to sort the file on the date fields.
harry d brown jr
Honored Contributor

Re: Sort question.


Sean,

With perl you could convert the "external date" into the internal date and sort on that date, which would just be an index to the original record.

live free or die
harry
Live Free or Die
Reinhard Burger
Frequent Advisor

Re: Sort question.

Hi Sean

I would try awk to reformat the file .
You can use the "blank" as Input Field Separator and as Output Filed separator the ";" { FS = " "; OFS = ";" }
This will deliver you as output :
66.196.65.25 - - [17/Sep/2002:09:00:57 -0500] "GET /robots.txt in the variables
$1 $2 $3 $4 $5 $6

$1 = 66.196.65.25
$2 = -
$3 = -
$4 = [17/Sep/2002:09:00:57
$5 = -0500]
$6 = "GET
$7 =

Next step is to split $4 into Date and time and store both values in variables

if the date is always in this format then its easy :
Date = substr( $4,2,11)
Bracket = substr ( $4,1,1)
Colon = substr ( $4,12,1 )
Time = substr ( $4,13,8)

outpuline is :

print $1,$2,$3,Bracket,Date,Colon,Time,$5,$6,$7

You get a line like this
66.196.65.25;-;-;[;17/Sep/2002;:;09:00:57;-0500;"GET;/robots.txt

Now you can sort on the fields you like.
Output this again into a file and run again awk to replace the ; by a
Now you should have a copy of your original file in the same format but sorted.

Following you will find the listing of both awk scripts needed :

script one used to create sortable fields

BEGIN { FS = " " ; OFS = ";" }


{
Date = substr( $4,2,11)
Bracket = substr ( $4,1,1)
Colon = substr ( $4,13,1 )
Time = substr ( $4,14,8)


print $1,$2,$3,Bracket,Date,Colon,Time,$5,$6,$7

}
#end of script

Script two used to remove ";" after sorting

BEGIN { FS = ";" ; OFS = " " }

{
print $1,$2,$3,$4 $5 $6 $7,$8,$9,$10
}
# $4 to $7 need to be placed WITHOUT a comma in between
# end of script


squence of commands :

awk -f <script one> logfile > logfilecopy1
sort
awk -f <script two> logfilecopy2 >logfilecopy2

forward file "logfilecopytwo to your weblog parser. Ithink this should work

Maybe it's not a nice way but it will do it. And there may also be shorter ways,. but as i'm not the big expert on the shell prompt i most often handle it in such sequences.
Hope it helps
keep it simple
john korterman
Honored Contributor

Re: Sort question.

Hi Sean,
the safest way is probably to read the time specification in each line and then convert month names into numerical values. You should then apply time specs. as a "numerical weight (year, month, day, etc)" at the end of each line in a temporary file.
Example: your input:
66.196.65.25 - - [17/Sep/2002:09:00:57 -0500] "GET /robots.txt

Temporary output:
66.196.65.25 - - [17/Sep/2002:09:00:57 -0500] "GET /robots.txt # 20020917090057

Then sort the lines of the temporary file using the "numerical weght" as key, but write out only until the distinctive character.
The attached script can (perhaps) create the temporary file.

regards,
John K.
it would be nice if you always got a second chance
Sean OB_1
Honored Contributor

Re: Sort question.

Thanks for the help guys, I'll try it today and see if I can get it to work.

Am I right that sort won't let you sort on a column? That would seem like the most basic requirement of a sort util.

For example sort on chars in columns 5-10.

James R. Ferguson
Acclaimed Contributor
Solution

Re: Sort question.

Hi Sean:

You can limit the sort key to character positions within the field. This is called a restricted sort key. For instance, if I had a file like:

a999bc
a123bc
a011bc
a099bc
a456bc

...I could sort on the first field using only positions (columns) two, three and four (the numeric part) with:

# sort sort -k1.2,1.4 filename

This is what I originally suggested when I said, "It would be helpful to re-format the data before sorting and/or to add a field start/field end to the keys specified".

Regards!

...JRF...
Rodney Hills
Honored Contributor

Re: Sort question.

Here is a one command line that should do it. It extracts the sort fields, reorders them, prefixes them on the line, does the sort, then removes the prefixed keys.

perl -ne 'if (/\[([^\]]+)\]) {@a=split("[:/]",$1);print $a[2]," ",$a[1]," ",@a[0,3,4,5]," ",$_;}' yourlog | sort -k1 -kM2 -k3 | cut -c19- >sortedlog

Hope this helps...

-- Rod Hills
There be dragons...
Sean OB_1
Honored Contributor

Re: Sort question.

Ok, I ended up using a kludge of what Jim suggested.

I used the [ as the delimiter and sorted on the second field. Fortunately for this file there was only 1 month so I didn't have to worry about getting the month order right. But for future times I'll have to deal with that.

Thanks for the help.

Sean