- Community Home
- >
- Servers and Operating Systems
- >
- Operating Systems
- >
- Operating System - HP-UX
- >
- Re: Script question - comm limits
Categories
Company
Local Language
Forums
Discussions
Forums
- Data Protection and Retention
- Entry Storage Systems
- Legacy
- Midrange and Enterprise Storage
- Storage Networking
- HPE Nimble Storage
Discussions
Discussions
Discussions
Forums
Forums
Discussions
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
- BladeSystem Infrastructure and Application Solutions
- Appliance Servers
- Alpha Servers
- BackOffice Products
- Internet Products
- HPE 9000 and HPE e3000 Servers
- Networking
- Netservers
- Secure OS Software for Linux
- Server Management (Insight Manager 7)
- Windows Server 2003
- Operating System - Tru64 Unix
- ProLiant Deployment and Provisioning
- Linux-Based Community / Regional
- Microsoft System Center Integration
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Community
Resources
Forums
Blogs
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО02-09-2009 12:44 PM
тАО02-09-2009 12:44 PM
I used "comm -23 file_a file_b" successfully with files containing around 4500 records.
The statement fails however with our production files containing 1.5 million entries each.
Are there some published limitations for this process?
Thanks
RayB
Solved! Go to Solution.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО02-09-2009 01:00 PM
тАО02-09-2009 01:00 PM
Re: Script question - comm limits
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО02-09-2009 01:09 PM
тАО02-09-2009 01:09 PM
Re: Script question - comm limits
Hmmm,
Does the man page really indicate a memory limit?
The is none mentioned in:
http://docs.hp.com/en/B2355-90689/comm.1.html
There is also no reason to expect a memory limitation. Just loop and compares.
The algoritme surely only ever has to look at 1 record from each file as they have to be in order.
For each set of two it can determine what to do: print 1,2 or 3, and read next from 1, read next from 2 or read next from both.
No reason to remember anything really.
(Okay may an extra line buffer to deal with duplicate input records).
fwiw,
Hein (not online right now)
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО02-09-2009 01:31 PM
тАО02-09-2009 01:31 PM
Re: Script question - comm limits
Have you run "sort" on both file1 and file2 before you ran the "comm" command on them?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО02-09-2009 03:34 PM
тАО02-09-2009 03:34 PM
Re: Script question - comm limits
If they are not sorted, you get unpredictable results. Since Raynald didn't explain what "failed" meant, that could be it.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО02-10-2009 07:13 AM
тАО02-10-2009 07:13 AM
Re: Script question - comm limits
For supplementary information, both filenames contain single field entries of 2 to 7 numbers followed by ".doc" (ex. 23456.doc) and have been sorted using "sort -n".
Their list looks like 12.doc, 14.doc, 123.doc, 137.doc, 1234.doc etc
"comm -23 file_a file_b" returns lines which exist in both files.
It works perfectly on smaller files.
LC_COLLATE and LANG variables are not set.
RayB
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО02-10-2009 07:59 AM
тАО02-10-2009 07:59 AM
Re: Script question - comm limits
->>
correction:
Options 1, 2, or 3 suppress printing of the corresponding column.
Thus comm -12 prints only the lines common to the two files; so you have to use ;
comm -12 file_a file_b to get lines available in both file.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО02-10-2009 02:02 PM
тАО02-10-2009 02:02 PM
SolutionThis is NOT sorted. comm(1) requires them to be in collating order by chars. I.e. no sort options. This is the "proper" ASCII order:
12.doc 123.doc 1234.doc 137.doc 14.doc
>Hakki: correction:
You read it wrong. RayB was saying that since he got files that existed in both files, it was broken.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО02-10-2009 07:38 PM
тАО02-10-2009 07:38 PM
Re: Script question - comm limits
I realize that it worked on smaller files, but I was thinking that it the heuristic for the would have a whole lot less to keep tabs on if it is sorted than if it is not. That is, what if line 1 matches line 1.5 million - the program may try keep those two lines, and everything in between in memory, and die. Whereas, if the sort was done before, the starting and end pointers in both files for correlating to active memory locations would/could be much closer together, and thus possibly much smaller.
This would somewhat analagous to the reason one uses the "-depth" indicator in find commands for large directory trees, so that you keep fewer file tree locations pointing to memory locations in an unresolved state. Basically, keeping less things to process at one point in time.
So, I was just thinking that "it might be" that the memory error could be just a case of somehow the program having too many "open issues" in process, and not enough closed ones on the really large files. In fact, it may be that *this* is the only reason what it would like the data sorted up front - just so that the algorithm has fewer "balls in the air" at any one time, and therefore won't run out of working space. My guess is that this theory is probably correct, since you've indicated that the program works on small files, even though according Dennis, you've probably performed the sorts wrong. That is, the program only needs the data sorted to reduce the amount workspace it needs to resolve the issues in one pass, otherwise, it would probably have to make multiple passes over the files with some creative block swapping to approach a result with huge files.
Just thinking about how I would have written the comm program, and how if I was having to write that program, the first thing I'd like is that it is a given that it sorted some sort of standard way that I could count on in advance too. Otherwise, I could envision that memory handling all at once could get out of control in large sets.
Try it alpha sorted as Dennis suggested and see. Maybe, maybe not - it's a possibility!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО02-10-2009 09:07 PM
тАО02-10-2009 09:07 PM
Re: Script question - comm limits
Didn't I say that? :-)
>I was thinking that it the heuristic for the would have a whole lot less to keep tabs on if it is sorted than if it is not.
If you don't want it sorted, you use diff(1).
The algorithm is very simple as Hein says, you look at two records and write the one that compares less than. Or both if equal.
>the algorithm has fewer "balls in the air" at any one time
Both balls are on the table in plain sight.
There are two FILE* buffers and two line buffers of LINE_MAX.
>Try it alpha sorted as Dennis suggested and see. Maybe, maybe not - it's a possibility!
("No! Try not. Do, or do not. There is no try." :-)
RayB gave the file content and they failed the "sort -c file" check.