Re: How to improve PIPE performance

Peter T Jackson · ‎01-02-2004

I have a 700,000 block text files, zipped to 90,000 blocks.
Unzipping takes 11 minutes.
Searching takes 6 minutes.
Using PIPE to send the output of the unzip directly to the zip and avoid using a temporary file takes 50 minutes.

Is there anyway to reduce the overhead of using PIPE?

Craig A Berry · ‎01-02-2004

I assume you mean "directly to the search" rather than "directly to the zip"? In other words, you only care about the search results and not the file itself? In any case I'm not sure why creating a temporary file is considered a bad thing, and clearly it's faster in your case.

I don't know of any way you can influence PIPE performance directly, though you might want to look at

$ help pipe description Improving_Subprocess_Performance

I doubt the recommendations there will make much difference in your case since I suspect the overhead comes from interprocess communication rather than from the initial spawning of a subprocess.

If you have a *lot* of memory, you could try unzipping to a RAM disk and doing the search there.

If the situation merits custom programming, you could start with the unzip sources and write something that searches unzipped output on the fly.

Martin P.J. Zinser · ‎01-02-2004

Hello,

assuming you are at 7.3-2 you might want to look at

http://h71000.www7.hp.com/doc/732FINAL/5763/5763pro_046.html

Altough this discusses the pipe C-RTL function since the DCL pipe is most probably implemented in C setting the logicals might affect it too.

Defining DECC$PIPE_BUFFER_SIZE to 65535 might not be a bad value.

And yes, I know there is at least one person here in the forum who is much more competent to comment on this ;-)

Greetings, Martin

Craig A Berry · ‎01-04-2004

Martin,

Unless this changed recently, DCL's PIPE command uses the undocumented pipe driver and bears no relation to the CRTL's pipe() function. Both do interprocess communication, but I don't know if there's any way to adjust buffer sizes for the pipe driver.

Craig A Berry · ‎01-04-2004

For a comparison of the pipe driver with the CRTL pipe from the guy who wrote the pipe driver (MPA0:), see

http://groups.google.com/groups?hl=en&lr=&ie=UTF-8&oe=UTF-8&selm=3B30F1EA.17FB4BAE%40compaq.com.doom

Peter T Jackson · ‎01-04-2004

Thanks for the answers.

Yes I meant "directly to search".
The files can be much larger than the one I used for my testing. The largest I have seen was over 11 Gbytes unzipped. Temporary files that large are a problem.
The system that generated that large a file had lots of memory but it was in use.

The procedure is written in DCL so that it can be easily distributed and so that security conscious customers can check that it is not a risk.
Often it will be run on the text file directly. The option to handle zipped files is there to avoid having to find the disk space to unzip them after receiving the zipped version over the internet, and so to reduce the size of the collection of files I use for testing. PIPE is simpler than trying to automate the handling of a temporary file.

I looked at the suggestions in help and played around with set RMS before asking here.

Pete

Brad McCusker · ‎01-05-2004

As others have said, DCL pipe is not implemented via C RTL pipe().

To help complete the circle of information in this string, the C RTL pipe implementation is based on mailboxes and the logicals DECC$PIPE_BUFFER_SIZE and DECC$PIPE_BUFFER_QUOTA directly translate to corresponding parameters in the crembx call.

Brad

Brad McCusker
Software Concepts International

Craig A Berry · ‎01-05-2004

I think the only options available to you, Peter, work against your requirement that it be something available out of the box in DCL.

If that requirement can be lifted, there are all sorts of PC utilities that search within zip archives, and it might be possible to find an open source one and port it.

I also found a Perl-based solution. There's no way to know whether it's faster than PIPE without trying it in your environment, but it might be worth a look. If you want to try it you'll have to have Perl and install the following extensions:

Compress::Zlib
Archive::Zip

These can be obtained from http://search.cpan.org. They have some problems building on VMS, but it can be done. I can probably walk you through it if you're interested.

Archive::Zip comes with a sample script that does exactly what you want, i.e., searches the contents of a zip archive. The following example does a case insensitive search for "fall" in a zipped version SYS$MANAGER:SYLOGIN.TEMPLATE.

$ zip archive.zip sys$manager:sylogin.template
adding: [.SYSMGR]SYLOGIN.TEMPLATE (deflated 61%)
$ perl zipGrep.pl "(?i:fall)" archive.zip
sysmgr/sylogin.template:$! process logins. Each section falls through into the next section,
sysmgr/sylogin.template:$! the "Batch" section. (Note that all "Interactive" users will "fall
sysmgr/sylogin.template:$! then fall through again into the other sections.)
sysmgr/sylogin.template:$! Fall through...
sysmgr/sylogin.template:$! Fall through...
sysmgr/sylogin.template:$! Fall through...

Martin P.J. Zinser · ‎01-05-2004

Hello Craig,

which opens up the question: Do we do our own pipe() in Perl for recent versions or is it the C-RTL one if and when available ;-)

Happy new year,

Martin

Craig A Berry · ‎01-05-2004

Just to be clear, for Peter's sake, Martin's question is a tangent. I suggested Archive::Zip because it doesn't use pipes at all, not DCL's, not the CRTL's, and not Perl's home-grown ones.

It's an interesting tangent, though. Perl since 5.6 uses a homegrown pipe implementation because the one in the CRTL is so prone to hangs and deadlocks. I believe this was still true as of v7.3-1, but I should probably give it a whirl again. I have a goal of making this configurable so you can choose which pipe implementation you want, but I haven't done it yet.

John Gillings · ‎01-05-2004

Craig,

Depending on the complexity of your SEARCH
requirements...

Since UNZIP is available in source form, you could create a version that does the search directly on the output. "Simple" matter of finding the final output stage and implementing a filter on the data stream.

That will save you a process creation and two lots of I/O processing (out of UNZIP and into SEARCH). Furthermore, since you're potentially avoiding writing any of the uninteresting text, you may reduce the overall time below your 11 minutes for the unzip alone.

If you don't want to modify UNZIP, another option is to create your own mailbox with whatever characteristics you think will help performance. Something like this:

program makemailbox
inputs: lognam,bufquo,maxmsg
$crembx(,chan,maxmsg,bufquo,,,lognam,,)
$hiber
end program

(may be simpler just to hard code the mailbox attributes)

$ SPAWN/NOWAIT makemailbox mymailbox max quo
$ SPAWN/NOWAIT SEARCH mymailbox whatever
$ PIPE UNZIP -c file > mymailbox

Notes:
The "PIPE >" command was easiest way I could find to redirect UNZIP output into the mailbox. It does not involve subprocess creation or use a pipe, so it's unlikely to have any negative performance effect.

The "makemailbox" subprocess needs to stay around until one of the other processes has opened a channel to the mailbox. It will remain indefinitely until killed or woken (SYS$WAKE).

The SEARCH subprocess will terminate at EOF of the UNZIPped stream. Issuing a second UNZIP against the same mailbox will put the main process into RWMBX because there is no reader.

No idea if this mechanism will be an improvement, try it and see!

A crucible of informative mistakes

Hein van den Heuvel · ‎01-05-2004

Think IOs. Think buffered IOs specifically.
Change your search to searc/stat and compare.

Anything involving a pipe/mailbox will have search do an IO (kernel mode routine) for each and every record. Synchroneously. Hopelessly slow.

When reading from a file search will use RMS with large buffers and RMS will use Asynchroneous read ahead from the disk. Depending on the fil e contents you'll see about 100 records per (direct) IO and no buffered IO other then the (status) output.

Accept the intermediate file, or adapt unzip to search for you as John suggests. Could put the temp file on ram disk?

hth,
Hein.

Peter T Jackson · ‎01-07-2004

Again thanks.

Unfortunately the advantages of using DCL are too great.
The point of using the PIPE is to avoid the need for a large temporary file.
The procedure can work on the text files directly so the unzip can be be done manually when the disk space is available.
Most users will not use the procedure on zip files often.
The only exception is myself when running my test suite.

A RAM disk would not be useful.
On the systems where very large files are created the memory is in use.
The largest I have seen was over 10 Gbytes.

It sounds like a pipe driver that have the option of doing buffered IO is the only way to improve this, but my procedure alone is not enough to justify writing one

Categories

Company

Local Language

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

Re: How to improve PIPE performance

How to improve PIPE performance