Operating System - HP-UX
1823967 Members
4126 Online
109667 Solutions
New Discussion юеВ

Any compression tool available which could use multiple CPU

 
SOLVED
Go to solution
Madhu Kangara
Frequent Advisor

Any compression tool available which could use multiple CPU

I have a requirement to transfer 35GB data file every day and thinking of doing a compression so that I could save the transmission time. Currently the time I save in transmission is nullified with the time it takes to do the compression. I have multi CPU servers available and thinking of using threading to overcome the compression time.

I found pbzip2 (Parallel bzip) which has multiple CPU support. But it have a 2GB file limit

Does any one have any recommendations?

Thanks in advance
Madhu
13 REPLIES 13
Fred Ruffet
Honored Contributor

Re: Any compression tool available which could use multiple CPU

I believe bzip2 wouldn't be a good choice anyway : it is really well compressing, but a little bit slow.
Maybe you should use compress. It compresses low but faster.

Regards,

Fred
--

"Reality is just a point of view." (P. K. D.)
Steven E. Protter
Exalted Contributor

Re: Any compression tool available which could use multiple CPU

None of these tools multi threads or multi processes for very good reason.

You can design your script to zip two tools a the same time.


do
control=$(ps -ef | grep zipsomething | wc -l)

if [ control -ge 2 ]
then
sleep 30
else
zipsomething &
zipsomething &
fi

That will at least insure that two zip processs are running. You'll have to be careful building the zipsomething command line so the same file is not zipped by two processes.

As to the tool gzip is good, can go up to 8 GB with patching.

SEP
done
Steven E Protter
Owner of ISN Corporation
http://isnamerica.com
http://hpuxconsulting.com
Sponsor: http://hpux.ws
Twitter: http://twitter.com/hpuxlinux
Founder http://newdatacloud.com
Madhu Kangara
Frequent Advisor

Re: Any compression tool available which could use multiple CPU

I liked the way pbzip2 worked for multiple CPU and I could specify how many CPUs to be used
Here is the link http://compression.ca/pbzip2/

But as I said earlier it has a 2GB file size limit and bzip2 does not have that limit for the version I use.

In my case I have a single 35GB file.
Fred Ruffet
Honored Contributor

Re: Any compression tool available which could use multiple CPU

SEP,

You're right for multiple files, but here there is only one file. So what is needed is really multi-thread. And it does not exis, as far as I know.

Regards,

Fred
--

"Reality is just a point of view." (P. K. D.)
Hein van den Heuvel
Honored Contributor

Re: Any compression tool available which could use multiple CPU

Here is an unfinished thought (because it is time to go to sleep!)....

'split' the large single file into fifo pipes.
Launch compresses for each pipe.
Start transfers as the compress jobs finish.
Uncompress, appending to a single file on the other side.
The uncompress, much like the transfer, would be single stream, but that generally takes less time than compress.

Here is a perl script that was supposed to split in parallel:

$file = shift @ARGV or die "Please provide file to split and # chunks";
$chunks = shift @ARGV;
$chunks = 4 unless $chunks;
$chunks = 26 if $chunks > 26;
$total = -s $file;
die "puny file" unless ($total > 10000000);
$name = "xxx_";
$chunk = int( $total / $chunks);
$i = 0;
while ($i < $chunks) {
$command = sprintf( "mknod %sa%c p", $name, ord("a") + $i++ );
printf "-- $command\n";
system ($command);
}
$command = "split -b $chunk $file $name";
$i = 0;
while ($i <= $chunks) {
print "-- $command\n";
exec ($command) unless fork();
$letter = ord("a") + $i++;
$command = sprintf( "cat %sa%c | gzip > %sa%c.gz", $name, $letter, $name, $letter );
}
$pid = 1;
$pid = wait() while ($pid > 0);


First problem was that gzip does not eat from fifo's... but cats do!
Biggest problem is that only one zip is going at a time because split is of course only writing one pipe at a time waiting for the result to be picked up.
One silly fix for that is to split into real intermediate files, and zip those. Yuck.
I think the better solution would be for the perl script to fork mutliple reader streams which each seek to their own start point and then read (binmode) and feed data into their own gzips.

On the other side, I think I'd go for a single unzip to combine the files. I don't think it will work to output into a single file from multipel streams after individual seeks. Then again, I suppose that would work, notably when starting block aligned.

Cheers,
Hein.
H.Merijn Brand (procura
Honored Contributor

Re: Any compression tool available which could use multiple CPU

how can compression programs suffer from file sizes when used in pipes?

If you compress a single file, that limit is a burden, and for ages gzip had that problem. It was easy to overcome using more recent versions from GNU.

bzip2 knows, just as gzip, compression rate command line parameters, that influence the CPU usage (-1 .. -9), but do not control the number of CPU's involved, so your question is very good.

The option of having a script take care of running two (or more) compressions at the same time is good, but why would pbzip2 not work on unlimited file sizes when in streaming mode?

# pbzip2 -options < very_very_large_file > compressed_file

and why use a compressed file anyway

# pbzip2 -options
using dd as a buffer

Another option would be to compile pbzip2 from source yourself, removing the file limit
http://compression.ca/pbzip2/
http://compression.ca/pbzip2/pbzip2-0.8.tar.gz

Enjoy, Have FUN! H.Merijn
Enjoy, Have FUN! H.Merijn
Steve Lewis
Honored Contributor

Re: Any compression tool available which could use multiple CPU

I have one, I wrote it myself and posted it to the original thread "favourite sysadmin scripts you always keep around" in August 2002, together with another program to decompress.
It is fixed at 4 threads, but that seems to be the optimum - even on my 12cpu rp8400 where it consumes over 1000% (yes one thousand) cpu in top.
It uses the zlib library so you need that installed. It runs at between 2-3 times the speed of compress.
The decompression is single stream, but that has been shown to be quickest.
The other thing is it is also 32 bit (i.e. 2Gb limit where you do not redirect stdout), but as has been pointed out, if you use it to read a pipe and merely append or redirect stdout using |, > or >> the 2Gb limit does not apply.
Alternatively you can modify the fopen() call in the code to be fopen64() on the output file. The worst thing about it is that is isn't compatible with compress/gzip or bzip2.

Here is the link:
http://forums1.itrc.hp.com/service/forums/parseCurl.do?CURL=%2Fcm%2FQuestionAnswer%2F1%2C%2C0x026250011d20d6118ff40090279cd0f9%2C00.html&admit=716493758+1101807043917+28353475

Make sure you test it though, it isn't commercial and comes with no warranty(!)


Madhu Kangara
Frequent Advisor

Re: Any compression tool available which could use multiple CPU

In between I recieved a patch for pbzip2 from its developer to fix the 2GB file size limit. So I will test that and will update the status

I liked some of the comments posted here and will give points to those
Steven E. Protter
Exalted Contributor

Re: Any compression tool available which could use multiple CPU

I agree with you Fred. I merely gave a multi-tasking methodology. None of these tools multi threads. Thats for a very good reason.

Reliability is more important than speed.

Nice hat Fred.

SEP
Steven E Protter
Owner of ISN Corporation
http://isnamerica.com
http://hpuxconsulting.com
Sponsor: http://hpux.ws
Twitter: http://twitter.com/hpuxlinux
Founder http://newdatacloud.com
Hein van den Heuvel
Honored Contributor
Solution

Re: Any compression tool available which could use multiple CPU

As replied earlier... you may want to check out solutions that keep all the bits in the air.

Anyway, over lunch I poked some more at a perl script to split a file and compress the parts, and it now works fine. (after I moved the open + seek from the parent to the children).

Here is a sample session for a 4GB file on an ia64 hp server rx7620 (8p)

# time perl split.pl xx.dat 6
6 x 699072512 byte chunks. 10667 x 65536 byte blocks. 4194305024 bytes
real 1:35.60, user 0.64, sys 29.37
# That's with over 75% cpu busy and gives:
# ls -l xx*
4194305024 Nov 29 21:05 xx.dat
163707548 Nov 30 10:41 xx.dat_1.gz
167387395 Nov 30 10:41 xx.dat_2.gz
163581093 Nov 30 10:41 xx.dat_3.gz
162035968 Nov 30 10:41 xx.dat_4.gz
159506304 Nov 30 10:41 xx.dat_5.gz
159981309 Nov 30 10:41 xx.dat_6.gz
#Put them back together with:
for i in xx.dat*gz
do
gunzip -c $i >> xx
done
real 1:07.8, user 50.0, sys 16.3
# ls -l xx
4194305024 Nov 30 10:47 xx
# doublecheck
# time diff xx xx.dat
real 2:13.2, user 1:32.8, sys 24.4


The script:

$|=1;
$file = shift @ARGV or die "Please provide file to split and # chunks";
open (FILE, "<$file") or die "Error opening $file";
close (FILE);
$chunks = shift @ARGV;
$chunks = 4 unless $chunks;
$chunks = 26 if $chunks > 26;
$total = -s $file;
die "puny file" unless ($total > 10000000);
# make last chunk the smallest
$block = 64*1024;
$blocks = 1 + int( $total / ($chunks * $block));
$chunk = $blocks * $block;
print "$chunks x $chunk byte chunks. $blocks x $block byte blocks. $total bytes\n";
$i = 0;
while ($i < $chunks) {
if ($pid=fork()) {
$i++;
} else {
open (FILE, "<$file") or die "Error opening $file in child $i";
binmode (FILE);
$pos = sysseek (FILE, $chunk * $i++, 0);
$name = "${file}_${i}.gz";
open (ZIP, "| gzip > $name") or die "-- zip error child $i file $name";
while ($blocks-- && $block) {
$block = sysread(FILE, $buffer, $block);
syswrite (ZIP, $buffer) if ($block);
}
exit 0;
}
}
$pid = wait() while ($pid > 0);



Enjoy!
Hein.
harry d brown jr
Honored Contributor

Re: Any compression tool available which could use multiple CPU

Can you put a Gig-E circuit between the two servers? If so, then the transfer of 35Gb would take about 5 minutes.

live free or die
harry d brown jr
Live Free or Die
Steve Lewis
Honored Contributor

Re: Any compression tool available which could use multiple CPU

I love Hein's (2nd) perl script. Provided that the sysseek call really does go direct to the right point in the file without a sequential search, you should get good performance and a high level of confidence out of it.
My program is truly multi-threaded and does it all in one pass up the file, but you won't have the confidence of the simplicity that Hein's perl solution gives you. You would also have to edit some of my C code to get the correct level of compression and fopen64.
For a 35Gb file, you need something that does only one pass up the source file, or several small scans of the parts.
Hein I take my hat off to you, that little script is probably what I was looking for 2 years ago.

Hein van den Heuvel
Honored Contributor

Re: Any compression tool available which could use multiple CPU

> Provided that the sysseek call really does go direct to the right point in the file without a sequential search

It does. Each child starts and stops pretty much at the same time, the actual data contents defining the cpu time needed.

> you should get good performance and a high level of confidence out of it.

With a reasonable IO system I believe it to give near inverse linear improvement in elapsed time for the number of chunks selected, up to the number of available cpu's. For final perfomance tweaks you might want to toss an mpsched to the zip command, and force one per cpu.

> perl solution gives you. You would also have to edit some of my C code to get

I find that it actually looks more like a C program then a perl script :^)

> I take my hat off to you, that little script is probably what I was looking for 2 years ago.

And a pretty wizards hat at that. Thanks! :-)

Obviously the script is still pretty rough. Only initial error handling, remnants of shady pasts (that '26' was for the split hack in the first attempt) and so on, but it should be a fine starting point for someones specialized solution (different output selection, different zip params, automatically determine (free) cpu count,...)

by making the last chunk the smallest I could keep the loop control simple: just read a selected number of blocks or untill you coudl read no more (the last chunk).
Cheers,
Hein.