topic Re: PERL for HTML file parsing in Operating System - HP-UX

PERL for HTML file parsing

Dodo_5 — Wed, 28 Feb 2007 00:51:31 GMT

i have a HTML report file..its in attachment(a part of the whole report is

attached..name "input html.doc").also its source is attached in "report

source code.txt"

i just want to seperate the datas like in first line it should be..

NHTEST-3848498958-NHTEST-10.2-no-baloo a
and so on for whole report

i have a perl script.its also attached ,named-"perl coding for

parsing.txt".It can give the required output.

now suppose i have more than 1 file,ie 20 report in html format.and i have

to compare different values of all the tables from different report files

(ie,to compare buffer cache values from different report file).

so how to do that..plss give me some ideas.
i need a script to do this in unix or perl..can you help me in this

regards.
waitin for ur reply

i have used :

sed -n "s/.*Buffer Cache:<\/TD><[^>]*> *$[0-9,]*[A-Za-z]*$<\/TD><[^>]*>

*$[0-9,]*[A-Za-z]*$.*/\1 \2/p" report.txt

its giving correct values for "buffer cache" but due to tag differences it

can't give correct values for "Redo Size".i think only by help of a script

i can do this...so pls help

Re: PERL for HTML file parsing

Maxim Yakimenko — Wed, 28 Feb 2007 02:07:09 GMT

Hi,

I don't just get - if you have text file why you strugle with HTML? Text file have no tags and formatting info - just grep out needed values ("Redo sizes") and compare them.

Re: PERL for HTML file parsing

Dodo_5 — Wed, 28 Feb 2007 02:21:14 GMT

i have got the text file for 1 html file..for more than one file i will get more text file.
then how to compare different values from different text files

Re: PERL for HTML file parsing

Maxim Yakimenko — Wed, 28 Feb 2007 02:41:06 GMT

So, what is the problem? For getting several values for comparision you must process several html files. Insteed of it process several text files.
1 to 1 is good - just grep out needed value from all files and compare them. For example, you can write a script that process one file. Output of this script is a line that contains needed values ("Redo size","Logical reads" and so on) separated by '\t' or comma or what-ever-you-want. Then run this script against all text files and collect output in another file.

IE:

#!/bin/sh
OUTPUT='./output.txt'
cat /dev/null > $OUTPUT
for FILE in `find . -name "*.txt"`;
do
script_process $FILE >> $OUTPUT
done;

In this example "processing_script" - perl script that greps out needed values.

That's all.

Re: PERL for HTML file parsing

Dodo_5 — Wed, 28 Feb 2007 04:23:22 GMT

i have a little doubt...

what is script_process $FILE >>OUTPUT
as you wrote "processing_script" as the perl script name.

also if i have to write the required item in -name???

can you just give comments over ur script so that it will be little easy for me.

Re: PERL for HTML file parsing

Maxim Yakimenko — Wed, 28 Feb 2007 04:42:19 GMT

Oh :) I err

Yes I mean,
script_process is a processing script written in perl that takes argument - file name to process, greps values and this script's output redirected to file $OUTPUT
And also you should point path to processing script. Correct version is:

#!/bin/sh
OUTPUT='./output.txt'
cat /dev/null > $OUTPUT
for FILE in `find . -name "*.txt"`;
do
./script_process $FILE >> $OUTPUT
done;

Command find . -name "*.txt" outputs list of txt files in current directory, you can point another dir - it is just example of how you can tell your script what files to process.

Re: PERL for HTML file parsing

Dodo_5 — Wed, 28 Feb 2007 05:00:32 GMT

i inderstand what have you told...that script will get values (like buffer cache,redo size etc) from text file.but forr that i have to run script_process.pl script.
but i actually need that perl script by which i can grep out the values from text file

Re: PERL for HTML file parsing

Maxim Yakimenko — Wed, 28 Feb 2007 05:03:39 GMT

Addon
If "report source code.txt" is html you must convert it to text - It can be done so:
for each table in html doc
match string that contain entire table
elminate tags and , tags and replace with "\t" and "\n" respectivly. Of coure cut off tag pair

Re: PERL for HTML file parsing

Dodo_5 — Wed, 28 Feb 2007 05:38:32 GMT

can you modify my perl script (attached)...for accepting html file as argument
then it will be easy for me...and i can parse any html file giving as an argument only

Re: PERL for HTML file parsing

Maxim Yakimenko — Wed, 28 Feb 2007 06:38:19 GMT

This is sample script. I am not strong in HTML::TokeParser so I used regexp to get rid of HTML.

#!/usr/local/bin/perl

#lets open file
open SRC, "$ARGV[0]";
#set line delim to undef
#thus we can treat file as a string
$/= undef;
#read data
$data=;
#close file
close SRC;

#take table part string
$data =~ /$.*.*<\/table>.*/igs;
$data = $&;

#get read of html
$data =~ s///ig;
$data =~ s/

/\n/ig;
$data =~ s/<\/table>//ig;
$data =~ s///ig;
$data =~ s/><\/th>/>Column\t/ig;
$data =~ s/<\/th>/\t/ig;
$data =~ s/<\/TD><\/TR>//ig;
$data =~ s/<\/td>/\t/ig;
$data =~ s/<\/tr>//ig;
$data =~ s///ig;
$data =~ s///ig;
$data =~ s/\x20{2,}/\t/ig;
$data =~ s/ /\t/ig;
$data =~ s/\t{2,}/\t/ig;

#for example we want redo size

$data =~ /redo size:\s{1,}([\d\.\,]{1,})\s{1,}([\d\.\,]{1,}).*/is;

#output result
print $1, "\t", "$2";

open SRC, "$ARGV[0]";
$/= undef;
$data=;
close SRC;
$html = HTML::TokeParser->new($data);

Re: PERL for HTML file parsing

Maxim Yakimenko — Wed, 28 Feb 2007 06:45:14 GMT

Correction
Sample script ends with
#output result
print $1, "\t", "$2";

Theese lines:
open SRC, "$ARGV[0]";
$/= undef;
$data=;
close SRC;
$html = HTML::TokeParser->new($data);
are example on how to get string from file and create parser object over this string.

Re: PERL for HTML file parsing

Dodo_5 — Wed, 28 Feb 2007 06:57:11 GMT

i have used

#!/usr/local/bin/perl
use strict;
use HTML::TokeParser;

then i run it as :

perl html_parse.pl html

where the script name is "html_parse.pl"
and the "html" is the name of my report file.

still it gives compilation error....please make required change in ur script to avoid error..

error:
Global symbol "$html" requires explicit package name at html_parse.pl line 49.
Execution of html_parse.pl aborted due to compilation errors.

Re: PERL for HTML file parsing

Maxim Yakimenko — Wed, 28 Feb 2007 07:04:48 GMT

Look at my previos message
Working script is:

#!/usr/local/bin/perl

#lets open file
open SRC, "$ARGV[0]";
#set line delim to undef
#thus we can treat file as a string
$/= undef;
#read data
$data=;
#close file
close SRC;

#take table part string
$data =~ /$.*.*<\/table>.*/igs;
$data = $&;

#get read of html
$data =~ s///ig;
$data =~ s/

Re: PERL for HTML file parsing

Dodo_5 — Wed, 28 Feb 2007 07:13:17 GMT

ok...now ur script is giving the correct value of "redo size" from the html report...but how to get buffer cache value or Memory Usage %(actually other values..).i have changed the $data variable,but it's not working.
u have defined the method for getting "redo size" but it's not valid for other parameters(actually tags are different in different cases).so that values can't be obtained.
so how to make a generalised script.i can't run different script for getting different parameters.there should be only one script (by which the different parameter value can be obtained.

Re: PERL for HTML file parsing

Maxim Yakimenko — Wed, 28 Feb 2007 07:45:54 GMT

U dont have to write separate script, just add regexp to match another params and add print statements to output then, of course it would be lengty - you must write it for every needed value.

Find in my script this line:
$data =~ /redo size:\s{1,}([\d\.\,]{1,})\s{1,}([\d\.\,]{1,}).*/is;

Block before elimantes HTML, so when things go to this line variable $data contain plain text, and u just have to write expressions to match another values.

Example for redo size says to regexp engine:
find words "redo size:"
after this words would be some spaces
then sequense of digits,commas and dots
then - spaces again
then sequense of digits,commas and dots
I have enclosed sequense of digits,commas and dots in round brackets - this means that matched patern goes to predefined perl vars $1, $2 and so on - first seq to $1 and second to $2

For this look at how to extract matches with perl - google and u'll find a lot of about this.

In script you can take a var for holding result:

#!/usr/local/bin/perl

open SRC, "$ARGV[0]";
$/= undef;
$data=;
close SRC;

#take table string
$data =~ /$.*.*<\/table>.*/igs;
$data = $&;

#get read of html
$data =~ s///ig;
$data =~ s/

/\n/ig;
$data =~ s/<\/table>//ig;
$data =~ s///ig;
$data =~ s/><\/th>/>Column\t/ig;
$data =~ s/<\/th>/\t/ig;
$data =~ s/<\/TD><\/TR>//ig;
$data =~ s/<\/td>/\t/ig;
$data =~ s/<\/tr>//ig;
$data =~ s///ig;
$data =~ s///ig;
$data =~ s/\x20{2,}/\t/ig;
$data =~ s/ /\t/ig;
$data =~ s/\t{2,}/\t/ig;

#for example we want redo size

$result=""

#match redo size Per Second Per Transaction
$data =~ /redo size:\s{1,}([\d\.\,]{1,})\s{1,}([\d\.\,]{1,}).*/is;
$result=$result."\t".$1."\t".$2;
#match Soft Parse %
$data =~ /Soft Parse %:\s{1,}([\d\.\,]{1,}).*/is;
$result=$result."\t".$1;

# and so on

#
#
#add matching for another values here
#
#

#output result
print $result;

Re: PERL for HTML file parsing

Dodo_5 — Wed, 28 Feb 2007 08:00:59 GMT

first of all many many thanks for ur interest & constant helping...

i got ur point....but
1) first thing is using ur script is lengthy(its ok ,no prob)...but also to set all the parameters for all the different values from table is not a good programming practice.

2)but can we print the values of "redo size" from 20 html files simultaneouly???
its the main requirement...then only i can compare the values in different reports.
i have to get "redo size" values from all html report in output by running the script.

so pls look into the matter...

Re: PERL for HTML file parsing

Maxim Yakimenko — Wed, 28 Feb 2007 08:25:44 GMT

1) Of cource it is not good programming practice - but it works, anyway - you have no choise, tables are not the same so you must explictly specify what to get and how to name it.

2) I'm trying to tell you that from the begining I look into the question matter :)

You have script that takes specified values out specified file - so run this script for every file in set and save result somewhere - thus you will have ur redo size, extracted out 20 files simult :) strictly speaking "serialy" :) but in one place
you can write batch to do this
like this:

#!/bin/sh
OUTPUT='./output.txt'
cat /dev/null > $OUTPUT
for FILE in `find . -name "*.txt"`;
do
./script_process $FILE >> $OUTPUT
done;

after completing file output.txt will contain values from files.
I wrote it in shell, but this batch can be written in perl too.

Idea is that - write script to process one file and write second script and to call first one for every file in set.

Re: PERL for HTML file parsing

Dodo_5 — Thu, 01 Mar 2007 01:37:00 GMT

can you just tell how to run the scripts in sequence...
tell the command line statements
i think i have to write first a PERL script named "script_process.pl"
then to write the shell script named "file_process.sh"
then if the name of report file is report.html
then just tell me how to execute one by one for getting correct answer...

Re: PERL for HTML file parsing

Maxim Yakimenko — Thu, 01 Mar 2007 02:07:15 GMT

I have already told you what to do.
If you did not get about it, lets try another way
Suppose you have 3 reports -
report1.txt
report2.txt
report3.txt
then you run:

create empty output.txt
./script_process.pl report1.txt >> output.txt
./script_process.pl report2.txt >> output.txt
./script_process.pl report3.txt >> output.txt

after it each line in output.txt will contain values extracted from reportN.txt.
Second script that you called file_process.sh do just that - it finds files and calls script_process.pl for each report file one by one. Read more about perl and shell. Then you open output.txt in Excel or another spredsheet processor that understands tab-delimted files and compare what you wish, build diagrams and so on.

Re: PERL for HTML file parsing

Maxim Yakimenko — Thu, 01 Mar 2007 02:12:12 GMT

Read attentivly scripts I have gave you and try to understand what each script do. Don't just copy them blindely.