Re: Extracting data from a huge file

Umair_1 · ‎06-09-2006

I basically need to extract chunks of data between two strings from two files in unix and then do some operation on those chunks. I have alredy written attached piece of code in KSH but its very very slow. Machines i am running are super fast so it is to do with code and searching. Suggestions to either improve this or write something else which will increase the speed will be appreciated.

MY APPROACH: data is in a huge file with chunks seperted between lets say these two string i,e chunk starting at " PUT" and ending at "PUT" e.g,

!!!!!!!!!!!!!!!EXAMPLE DATA!!!!!!
PIN INPUT(
first blah blah( blah
blah
blah blah !@#$#
blah

blah blah blah blah
blah

*** **PUT
!!!!!!!!!!!!!111

MY CODE
=======
loop1

IFS=":"
##DATA FROM BVR###
fileone=$(cat first.file)
dataone=${fileone#*"$i"}
dataone=${dataone%%PUT*}
print $dataone > buf.first
##DATA FROM EVR###
filetwo=$(cat second.file)
datatwo=${filetwo#*"$i"}
datatwo=${datatwo%%PUT*}
print $datatwo > buf.second

loop2
some basic grep data manipulation on
the chunks from above
endloop2

end of loop1

A. Clay Stephenson · ‎06-09-2006

At the very least, use awk but a better, faster result will be had through Perl. I've often seen Perl perform as well as C especially for complex regular expression matching.

Awk is easier to learn but a really sneaky method is to write your script in awk and then when it is working s desired, use a2p to read your awk script and output an equivalent Perl script. Awk or Perl will probably be 50-100X fater than your current approach.

If it ain't broke, I can fix that.

Umair_1 · ‎06-09-2006

I agree, just for reference my code starts very fast and then slows down , i am concluding that it is because when i am searching for my data from deep down in the file its taking time and from top or even bottom i tested doesnt take that long for each chunk to get extracted.

In my case the pattern is very simple all data AS IS from a file between "PUT" to "PUT" from a file. I can write python/perl script but wrote them an year ago so any simple examples would be appreciated and i can then write my code. Thanks for the input

Sandman! · ‎06-09-2006

imho the UNIX command ex(1) can be quite useful for extracting chunks of data between two strings or delimited by a well-defined pattern. In your case here's an ex script that does the job based on the sample input you have provided.

ex -s inputfile <g/^.*PUT($/+1,/^.*PUT$/-1p
q
EOF

If you provide a representative sample of the input to be processed, the above ex code snippet can be tweaked further to meet the specified criteria.

Peter Nikitka · ‎06-09-2006

Hi,

if the amount of data you need (between the 'PUT' string) is small against that of the whole file, it is best to put that first into a temporary file. Esp. when you need to process this data more than once, it will really speed up.

About the method for parsing, I will not say, that a ksh solution is always much slower - but only when your solution is purely written with builtin funcions, e.g.
NOT
fileone=$(cat first.file)
(uses 'cat')
BUT
fileone=$(
BTW. awk would be my first choice, netherthess.

mfG Peter

The Universe is a pretty big place, it's bigger than anything anyone has ever dreamed of before. So if it's just us, seems like an awful waste of space, right? Jodie Foster in "Contact"

Umair_1 · ‎06-10-2006

Thanks. As i said i debugged and found out that the time it takes is in when searching for the chunk with my method and not variable to file redirection. More specifically i tried some sample cases and the matches from realtive top and bottom are not the problem , but it searches for pattern which are deep down within the file(80000 lines) then it takes like 4-5 minutes per pattern.

When debugged this line takes 5 minutes when the pattern is in the middle of file,from the top and bottom it takes seconds.
=======================
dataone=${fileone#*"$i"}
=========================
pyhton/perl equivalent giving the same delay. I will try this ex(1) thing though.

Thanks all

Peter Nikitka · ‎06-10-2006

Hi,

perhaps you tell us, what your loop is/ loops are you do through your data - it is really not uniq your pseudocode tells us about "$i".

mfG Peter

The Universe is a pretty big place, it's bigger than anything anyone has ever dreamed of before. So if it's just us, seems like an awful waste of space, right? Jodie Foster in "Contact"

Categories

Company

Local Language

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

Re: Extracting data from a huge file

Extracting data from a huge file