Operating System - HP-UX
1834745 Members
2789 Online
110070 Solutions
New Discussion

Re: Anyone ever seen the source code to tee ?

 
Steve Lewis
Honored Contributor

Anyone ever seen the source code to tee ?

Hi, I am using tee in conjunction with zcat, a named pipe, wc and a database load, all at once.

$ mkfifo named_pipe
$ zcat $COMPRESSED_FILE | tee named_pipe | wc -l &
$ dbload...(from the named pipe)...

Its reading from SCSI320 15krpm disks and loading into Informix on an xp12k. Its on a 16 way SD32 with tons of cpu and i/o capability.

The objective is to allow database loads direct from compressed files and do a separate count on rows in the uncompressed data at the same time. The w/c output is compared with the number of rows loaded by dbload as an assurance. This method condenses 3 scans up the data files into a single pass, some of which are several Gb.

But: the i/o is low, 600 block/sec being read, zcat has low cpu (3%), the wordcounts are running at about 35%, the data loads at about 50%, but the tee processes are taking 80% cpu.

Does tee read byte by byte or has it a modicum of buffering built in to the program?

Incidentally I am also finding that named pipes aren't as fast as using | . Does that make sense to people? Why?



8 REPLIES 8
Arunvijai_4
Honored Contributor

Re: Anyone ever seen the source code to tee ?

Hi Steve,

The tee program is known as a â pipe fitting.â tee copies its standard input to its standard output and also duplicates it to the files named on the command line.

More information at, http://www.tldp.org/LDP/abs/html/extmisc.html

-Arun
"A ship in the harbor is safe, but that is not what ships are built for"
RAC_1
Honored Contributor

Re: Anyone ever seen the source code to tee ?

I have not looked at source code of teem command. But tee does two things at a time. give std out and at the same write to a file. In your case it a named pipe means, it can't write contineously. A pipe has limitations as to how much it can write and wait for next write. The first write needs to be read/get_out of the pipe, before it accepts next chunk of write. I think these limitations (about write chunk size, read size) are at kernel level and you can not di much about it.

The nature of your command makes tee more slow in your case. I am sure that your complete unzip to a file system and read from their and loading into datbase would be much faster.
There is no substitute to HARDWORK
Steve Lewis
Honored Contributor

Re: Anyone ever seen the source code to tee ?

The code works, just not quite as fast as I would like. For instance a straight uncompress of the data zooms through reading/writing 20,000-80,000 blocks per second.

Straight data loads from uncompressed data go at the same rate, but combining them with a named pipe and a tee slows it all down. Its actually a close call i.e.

uncompress, then dbload, then wc -l involves 3 scans up 100Gb of data, but that is only about 15% slower than running all 3 commands in parallel - 100Gb single scan through tee and into wc, but also through a named pipe.

Maybe the named pipe really is writing into the filesystem and reading it back out again - i.e. converting 3 reads into 2 reads and one write which might explain the performance. It appears at the moment that | goes through memory whereas named pipes actually cause the file to grow to 8kb in size and sar -b to go through the roof. True?

Patrice Le Guyader
Respected Contributor

Re: Anyone ever seen the source code to tee ?

Hi steve,

I know that's not exactly what you're looking for but if it can help you, I've joined the source code of tee in MINIX.

have a look at http://www.minix.org/

Regards.
Pat
Good judgement comes with experience. Unfortunately, the experience usually comes from bad judgement.
RAC_1
Honored Contributor

Re: Anyone ever seen the source code to tee ?

Check following document. Not exactly that applies to your situation, but explains how exactly tee works.

http://www1.itrc.hp.com/service/cki/docDisplay.do?docLocale=en_US&docId=200000081050133
There is no substitute to HARDWORK
Steve Lewis
Honored Contributor

Re: Anyone ever seen the source code to tee ?

Strange how 3 large scans up 100Gb, which is too large to be cached, is nearly as fast as a single scan and two pipes reading off the back of it.

Thanks for the help so far. It looks like my bright idea is becoming less bright when tested. It does at least mean that I can save disk space, copy compressed stuff around and load direct into the DB without needing 200Gb of free space to uncompress it beforehand, plus it saves me half an hour of waiting, so not all bad.
rick jones
Honored Contributor

Re: Anyone ever seen the source code to tee ?

You might find the output of a tusc trace against tee to be enlightening. I'm guessing that the tee binary is stripped, but you still may find something interesting if you profiled it with Caliper.
there is no rest for the wicked yet the virtuous have no pillows
A. Clay Stephenson
Acclaimed Contributor

Re: Anyone ever seen the source code to tee ?

Actually tee should be less than 150 or so lines of code; it's quite easy to build one yourself. The implementations I've simply use getchar() to read stdin and putchar() to write on stdout. The other files are written by putc(). All of these are buffered by the operations are one character at a time. It should be possible to build a larger buffered implementation using fread()'s and fwrites()'s that is more geared to your type of i/o. If only I knew some C or C++, I could probably code some very tight code in a few minutes.
If it ain't broke, I can fix that.