how to parse this file - shell script

amonamon · ‎06-10-2007

Hello

I have file with these values:

search|989987|17315766859|772812962255||1|0|10-06-2007 14:24:38|10-06-2007 14:24:38
act|989987|17315766859|772812962255||||10-06-2007 14:24:38|10-06-2007 14:24:38

I sorted file with 2. column
With normal situation I could have one line starting with search and second with act

but how could I find in file lines that have for example variations in 4th coulumn:

search|989987|17315766859|172812962255||1|0|10-06-2007 14:24:38|10-06-2007 14:24:38
search|989987|17315766859|722812962255||1|0|10-06-2007 14:24:38|10-06-2007 14:24:38
search|989987|17315766859|773812962255||1|0|10-06-2007 14:24:38|10-06-2007 14:24:38

thanks in advance

Hein van den Heuvel · ‎06-11-2007

Anonamon,

Please try your question again. It is not clear enough and the example input a little too short

My knee-jerk reaction is to say sort it on column 2 and column 4 and push it through 'uniq'.

>> I sorted file with 2. column

Does that mean on the second column, starting with 1?

>> With normal situation I could have one line starting with search and second with act

Is the file also sorted by column 1 ?
That is, are all 'search' entries together?

How much data is there? megabytes or gigabytes?

Here is something which might work:

awk -F"|" '{if (last != $2) {for (x in a) delete a[x]; last=$2} if ($1 in a) { print } else {a[$1]=1}}' x.txt

It keeps an array 'a' for every first word value seen within lines for column 2. It remembers where is was using variable 'last'. If the current column 2 is different from last, then it deletes all remembered elements from 'a'. If it sees a line for which is has an element in 'a'... print it, else rememebr this one.

If your awk has the 'delete array' function then this simplyfies a little to:

$ awk -F"|" '{if (last != $2) {delete a; last=$2} if ($1 in a) { print } else {a[$1]=1}}' x.txt

hth,
Hein

amonamon · ‎06-11-2007

U are right...here is better explanation

I have arount 5000 lines.

search|989987|17315766859|772812962255||1|0|10-06-2007 14:24:38|10-06-2007 14:24:38
act|989987|17315766859|772812962255||||10-06-2007 14:24:38|10-06-2007 14:24:38
search|289987|22215766859|172812962211||1|0|10-06-2007 14:23:38|10-06-2007 14:24:38
act|289987|22215766859|172812962211||||10-06-2007 14:23:38|10-06-2007 14:24:38
search|089987|07315766859|322812962200||1|0|10-06-2007 14:24:38|10-06-2007 12:24:38
act|089987|07315766859|322812962200||||10-06-2007 14:24:38|10-06-2007 12:24:38
..
...
I want to see "search"-lines from this file that does not have its matching "act" line

for example in file might be:

search|089987|07315766859|902812962660||1|0|10-06-2007 14:24:38|10-06-2007 11:04:39
search|089987|07315766859|902444962660||1|0|10-06-2007 14:24:38|10-06-2007 10:24:38

by default if I sort this file by $4 field I could be able to have

search
act
search
act

To make sure that one pair (search+act) is on it has to have $2 and $4 filed SAME.

Hope this is better..:(

amonamon · ‎06-11-2007

***

To make sure that one pair (search+act) is OK it has to have $2 and $4 filed SAME.

Hein van den Heuvel · ‎06-11-2007

I'm still not entirely sure, but you may well have a solution expanding on the awk script below:
------ nomatch.awk -----
{ if ($2 != l2 && $4 != l4) {
if (!a) { print s };
if (!s) { print a };
l2 = $2;
l4 = $4;
a = 0;
s = 0;
}
}
/^search/{ s = $0 }
/^act/{ a = $0 }
END {
if (!a) { print };
if (!s) { print };
}
--------------------------
use as:

awk -F"|" -f nomatch.awk

One only knows that there was no matching 'search' or 'act' when it is too late. That is when one sees a new $2, $4 combination. Correct?
So the script remembers each line with search in s and and each line with act in a.

It remembers the last $2 in l2 an last $4 in l4.
If a line has a new $2 and $4, then there should be an s and an a picked up. Report if not! When a new $2 + $4 shows up. clear the current s and a.

Close?

Good luck,

Hein.

amonamon · ‎06-11-2007

well if I first sort file with $2 and then with $4 then I can easy remove all those
search - act lines from the file - they are not what I want..

I just want those search lines that do not have their act partner

and I tried your script I just get:

0
0
0
0
0
as output.. but I want to have ouput search lines that has no act lines..(it is then clear that every act has to have its search)

Sandman! · ‎06-11-2007

Try the awk construct below. It will print all "search" lines that are missing their "act" counterparts and vice-versa. The lines are joined on the fields which are common i.e. $2 and $4.

# awk -F\| '{x[$2$4]++;l[$2$4]=$0}END{for(i in x) if(x[i]==1) print l[i]}' infile

~hope it helps

Hein van den Heuvel · ‎06-11-2007

Please re-try my script on the EXACT information you showed. It really works for me. With tose 6 sample lines it shows nothing. Delete an 'act' line (or search line) and it will show the 'search' line without that 'act' (or the act line missing a search).

The script would print '0' if there were empty lines in the data feed. You did not indicate that it should be ready to deal with those. Simple adjustment... if one knows the input data.

For debugging I would put and extra print line after "{ if ($2 != l2 && $4 != l4) {"

print NR,l2,$2,l4,$4,$1

And if it still does nto work, attach a more significant (failing) dataset as .TXT file to a further reply ?

Admittedly there is a minor cut & paste error in my script, in the END section.
It should read
if (!a) { print s};
if (!s) { print a};
Of course this is just to deal with the last data lines in the file.

Hein.

Dennis Handly · ‎06-11-2007

Instead of using associative arrays, you can just sort on the column in question.

If you already know that column 2 is fine and that there is a "search" and "act" for every key, you can just work on column 4.

It seems you allow dups for column 2. So you need to have that many "act" for each "search"?

So to use sort then awk. The -k1r,1 would
sort "search" before "act".

$ sort -t"|" -k4,4 -k1r,1 file | awk -F"|" ...

You would check for a match on $4. If "search", increment. If "act", decrement.
If you don't get a match, if the count isn't 0, you have a mismatch. If you need to make sure the other columns match, you would need to save the lines.

Is this what you wanted?

?To make sure that one pair (search+act) is OK it has to have $2 and $4 filed SAME.

Ok, it seems you want $2 and $4 as an extended key.
$ sort -t"|" -k2,2 -k4,4 -k1r,1 file | awk -F"|"

And match on the concatenated key:
key = $2 $4

amonamon · ‎06-12-2007

seems that only sendman undestood..I am sorryy for my confucion..I am still testing and trying to rebuld solutions that U presented..

It will print all "search" lines that are missing their "act" counterparts and vice-versa. The lines are joined on the fields which are common i.e. $2 and $4.

That is what I want..but just lines that are missing their "act" line..becouse every act MUST have in file its search

# awk -F\| '{x[$2$4]++;l[$2$4]=$0}END{for(i in x) if(x[i]==1) print l[i]}' infile

amonamon · ‎06-12-2007

thanks everyone for lot of help but just one smalll thing..

can U clear me the meaning of:

-k1r,1 in sort command and also I am confused with sendman awk which works fine..:(

thanks a lot a lot..

Dennis Handly · ‎06-12-2007

Ok, I've attached my test script. It uses sort and awk to simulate COBOL sort with an output procedure. :-)

As it reads the possible duplicates, it prints out the last one, of the one with too many.

>but just lines that are missing their "act" line..because every act MUST have in file its search

I check both ways. You can delete the dead code if you want.

>Can you tell me the meaning of:
-k1r,1 in sort command

Sort on the first key, with decending (r) order.

>I am confused with Sandman awk which works fine.

Well, for every concatenated key, it saves the line and maintains a count.

At the end, it prints each key that only had one.

My script allows you to have more than one "search" & "act" pairs for each key.

If you don't need that, more dead code.

Hein van den Heuvel · ‎06-12-2007

amonamon,

Still confused with:
# awk -F\| '{x[$2$4]++;l[$2$4]=$0}END{for(i in x) if(x[i]==1) print l[i]}' infile

Just break it up...

# awk -F\| #split on |, not whitespace
'{ #for every record
x[$2$4]++; #using field2 concatenated with field4 as index, increment an array element in array x, creating it if needed.
l[$2$4]=$0 #remember the (last) line with with the same index in array l
} #
END{ # after the last record
for(i in x) # walk the array x picking up index values in variable i
if(x[i]==1) # if the incremented value for that array element is exactly 1
print l[i]}' # then print remembered line.
infile

So... it stored about 1/2 the lines in memory
And... if the input every looked like 'search,act,test' it would also print.
And if the input looked like 'search,search' for the same $2 $4 then is would not print.
And it would be happy with 'search a,search b,act a, act b'

It would be a little better to use

if (x[i]!=2) instead of ==1

It would be more robust if there wasa seperate array 'a' to count 'act' lines and 'x' just counted 'search' lines.
YOu could then there for x being exactly 1 and make sure the corresponding act is also 1

The right solution all depends on the errors you are expecting/tolerating. Sandman's line may well be good enough! (but mine is perfect as it does not consume memory and would find all issues above :-) :-) :-)

~hope it helps

Hein.

amonamon · ‎06-13-2007

what to say..

Thnaks a lot for your altruistically help!

amonamon · ‎06-13-2007

Again me..I just figured out that Hein code covers every case some of "search" lines after sendman awk I executed are not listed in output althought they DO NOT have their act "partner" but after I execute

awk -F"|" '
{ if ($2 != l2 && $4 != l4) {
if (!a) { print s };
if (!s) { print a };
l2 = $2;
l4 = $4;
a = 0;
s = 0;
}
}
/^aearch/{ s = $0 }
/^act/{ a = $0 }
END {
if (!a) { print };
if (!s) { print };
} ' fileIN

I got those search lines..but just do not say that I am boring..I am realy trying to learn here but hein again I need your explanation about this your code..:))

regards,

Hein van den Heuvel · ‎06-13-2007

That's ok... looks like it is was really broken (a little)...

l4 # last known value for field 4
l2 # last known value for field 2
s # last input line with 'search' or 0
a # last input line with 'act' or 0
awk -F"|" '{ # for every line split by |
if ($2 != l2 && # if field 2 changed AND
$4 != l4) { # field 4 changed.
#that should be an OR: if ($2 != l2 || $4 != l4) {
if (!a) { print s }; # print last search line if there is no a (still 0)
if (!s) { print a }; # optional!
l2 = $2; # remember this new key
l4 = $4; # remember this new key
a = 0; # not yet seen search line
s = 0; # not yet seen act line
}
}
/^search/{ s = $0 } # line starts with search? Remember it.
/^act/{ a = $0 } # ditto for act
END { # be nice and test after last record
if (!a) { print s };
if (!s) { print a };
} ' fileIN

I must have pasted an intermediate version with errors before. Sorry.

Also, my script will approve 'act,search', not just 'search,act'
Easy to fix... if that is your requirement by using: /^act/{if (s) a=$0}

Hein.

Sandman! · ‎06-13-2007

>Again me..I just figured out that Hein code covers every case some of "search"
>lines after sendman awk I executed are not listed in output althought they DO
>NOT have their act "partner"...

I am not sure what you mean by the above statemen? Do you mean that the awk code I posted earlier does NOT print the stand alone "search" lines? That is not a correct statement in my view. Although I posted the code without testing it, the awk code outputs the results fine after running it on your dataset. Yes it does consume memory for storing all those array indices and the corresponding lines :( but as far as the results go it would display un-matched "search" and "act" lines. Another assumption is that there is a 1-to-1 ratio proportion of the "search/act" pairs i.e. no duplicate "search/act" pairs are present in the input file. If that is not the case and duplicate "search/act" pairs are indeed allowed please clarify so that the code can be tweaked for that variation. Moreover since you are only interested in "search" lines here is a better version (albeit not perfect :) of the awk construct I posted earlier.

awk -F\| '
{
i=$2$4
x[i]++
l[i]=$0
f[i]=$1
} END {
for(i in x)
if(x[i]==1 && f[i]=="search")
print l[i]
}' infile

~hope it helps

Categories

Company

Local Language

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

how to parse this file - shell script

how to parse this file - shell script

Re: how to parse this file - shell script

Re: how to parse this file - shell script

Re: how to parse this file - shell script

Re: how to parse this file - shell script

Re: how to parse this file - shell script

Re: how to parse this file - shell script

Re: how to parse this file - shell script

Re: how to parse this file - shell script

Re: how to parse this file - shell script

Re: how to parse this file - shell script

Re: how to parse this file - shell script

Re: how to parse this file - shell script

Re: how to parse this file - shell script

Re: how to parse this file - shell script

Re: how to parse this file - shell script

Re: how to parse this file - shell script