Operating System - HP-UX
1748179 Members
4134 Online
108758 Solutions
New Discussion юеВ

Re: Sort in awk is very slow

 
SOLVED
Go to solution
Neil Edwards
Advisor

Sort in awk is very slow

Hello Gang,

I seem to have coded myself into a corner. I am working on a fairly complex set of financial reports written mainly in awk. The reports work but take a very long time to complete. I have found that my main problem is the sort. The sort example I use was taken from the O'reilly Book "Sed & Awk" on page 222. It is very compact and works well when the array is small but when processing about 5000 accounts can take several minutes to finish. My data is actually accumulated in a number of parallel arrays. I also have an array which looks like this:
202-456-4490 908
102-412-6512 134
000-000-0000 0
000-000-0001 1

The first field is the account number and the second field is the
index in the original arrays. I then sort this array and output the original parallel array values using the second field of this sorted array to access the data. The reports now print in account number order just as they should. My only problem is the speed of the sort. I am thinking about writing the data to a file and then calling the sort program but maybe someone has a better idea.

Thanks in advance, Neil
It wasn't me.
9 REPLIES 9
harry d brown jr
Honored Contributor

Re: Sort in awk is very slow


perl is an option, or even faster hardware.

live free or die
harry
Live Free or Die
harry d brown jr
Honored Contributor

Re: Sort in awk is very slow

Try converting your awk script to perl:

a2p

live free or die
harry
Live Free or Die
A. Clay Stephenson
Acclaimed Contributor
Solution

Re: Sort in awk is very slow

Hi Neil:

I had to pull out my copy of 'Sed & Awk' to see what you were talking about but as soon as I did I saw what your problem is. This is a classic 'bubble sort' and its performances degrades with the square of the number of elements.

Before I give you a good answer, I suppose I should point out that all that parallel array nonsense (excuse me - code) could have been avoided with the use of perl (specifically the Class:Struct module) which could have kept all your data related to one account in a nice box (a struct/record).

The good news is that in my awk code library, I have a sort function that will do the trick and should do in at most a few seconds what takes your bubblesort minutes. (It does assume that your arrays are numeric indexed rather than associative and the lowest index is 1).

use it like this:

qsort(my_array,1,count)
Arg 1 = the array to sort
Arg 2 = 1 (because this is a recursive function it needs an array lower bounds; should always be 1)
Arg 3 = number of elements in the array

This should fix you, Clay

If it ain't broke, I can fix that.
Neil Edwards
Advisor

Re: Sort in awk is very slow

Thanks for the quick responses!

I've tried to learn Perl but I seem to get bogged down everytime I try. Clay, I put in your sort function and sorts that were taking 15 minutes are now done in 1 or 2 seconds!!! I can only say wow!!! I am truly amazed because your sort is only a few lines longer than mine. My users are going to be impressed with 'my' work.

Thanks again, Neil
It wasn't me.
A. Clay Stephenson
Acclaimed Contributor

Re: Sort in awk is very slow

Hi again Neil:

I'm glad you liked the sort function. The fundamental problem with your sort is that the exchange interval (1) is so small that repeated passes are required to sort the data.
I actually have a better function that combines the best features of quicksort and bubblesort by switching to bubblesort when the array boundaries are small. Bubblesort excels in this environment. Since you are already in the 1-2 second range and seem happy, I don't think I will bother.
If it ain't broke, I can fix that.
Neil Edwards
Advisor

Re: Sort in awk is very slow

Hi Clay,

I just saw your last post. Would you mind posting your better sort function? I would really like to squeeze all the speed out of this program that I can.

TIA, Neil
It wasn't me.
A. Clay Stephenson
Acclaimed Contributor

Re: Sort in awk is very slow

Okay, but I suspect the improvements are going to be small. Use it like this:

bqsort(my_array,1,count,6)

The 4th argument is the 'bubblepoint' and whenever a quicksort partition size falls to this value, bubblesort is used instead.

The 'bubblepoint' should probably be kept in the range of 4 to 10 - 6 is an all-around good value for the vast majority of datasets.

Clay
If it ain't broke, I can fix that.
A. Clay Stephenson
Acclaimed Contributor

Re: Sort in awk is very slow

Oops ....

I'm am idiot. That should be:
qbsort(my_array,1,count,6)

Clay


If it ain't broke, I can fix that.
Neil Edwards
Advisor

Re: Sort in awk is very slow

Thanks again Clay. That speeded up the sort by maybe 10%.

Neil
It wasn't me.