Re: PERL for HTML file parsing

Dodo_5 · ‎02-27-2007

i have a HTML report file..its in attachment(a part of the whole report is

attached..name "input html.doc").also its source is attached in "report

source code.txt"

i just want to seperate the datas like in first line it should be..

NHTEST-3848498958-NHTEST-10.2-no-baloo a
and so on for whole report

i have a perl script.its also attached ,named-"perl coding for

parsing.txt".It can give the required output.

now suppose i have more than 1 file,ie 20 report in html format.and i have

to compare different values of all the tables from different report files

(ie,to compare buffer cache values from different report file).

so how to do that..plss give me some ideas.
i need a script to do this in unix or perl..can you help me in this

regards.
waitin for ur reply

i have used :

sed -n "s/.*Buffer Cache:<\/TD><[^>]*> *$[0-9,]*[A-Za-z]*$<\/TD><[^>]*>

*$[0-9,]*[A-Za-z]*$.*/\1 \2/p" report.txt

its giving correct values for "buffer cache" but due to tag differences it

can't give correct values for "Redo Size".i think only by help of a script

i can do this...so pls help

Maxim Yakimenko · ‎02-27-2007

Hi,

I don't just get - if you have text file why you strugle with HTML? Text file have no tags and formatting info - just grep out needed values ("Redo sizes") and compare them.

Dodo_5 · ‎02-27-2007

i have got the text file for 1 html file..for more than one file i will get more text file.
then how to compare different values from different text files

Maxim Yakimenko · ‎02-27-2007

So, what is the problem? For getting several values for comparision you must process several html files. Insteed of it process several text files.
1 to 1 is good - just grep out needed value from all files and compare them. For example, you can write a script that process one file. Output of this script is a line that contains needed values ("Redo size","Logical reads" and so on) separated by '\t' or comma or what-ever-you-want. Then run this script against all text files and collect output in another file.

IE:

#!/bin/sh
OUTPUT='./output.txt'
cat /dev/null > $OUTPUT
for FILE in `find . -name "*.txt"`;
do
script_process $FILE >> $OUTPUT
done;

In this example "processing_script" - perl script that greps out needed values.

That's all.

Dodo_5 · ‎02-27-2007

i have a little doubt...

what is script_process $FILE >>OUTPUT
as you wrote "processing_script" as the perl script name.

also if i have to write the required item in -name???

can you just give comments over ur script so that it will be little easy for me.

Maxim Yakimenko · ‎02-27-2007

Oh :) I err

Yes I mean,
script_process is a processing script written in perl that takes argument - file name to process, greps values and this script's output redirected to file $OUTPUT
And also you should point path to processing script. Correct version is:

#!/bin/sh
OUTPUT='./output.txt'
cat /dev/null > $OUTPUT
for FILE in `find . -name "*.txt"`;
do
./script_process $FILE >> $OUTPUT
done;

Command find . -name "*.txt" outputs list of txt files in current directory, you can point another dir - it is just example of how you can tell your script what files to process.

Dodo_5 · ‎02-27-2007

i inderstand what have you told...that script will get values (like buffer cache,redo size etc) from text file.but forr that i have to run script_process.pl script.
but i actually need that perl script by which i can grep out the values from text file

Maxim Yakimenko · ‎02-27-2007

Addon
If "report source code.txt" is html you must convert it to text - It can be done so:
for each table in html doc
match string that contain entire table
elminate tags and , tags and replace with "\t" and "\n" respectivly. Of coure cut off tag pair

Dodo_5 · ‎02-27-2007

can you modify my perl script (attached)...for accepting html file as argument
then it will be easy for me...and i can parse any html file giving as an argument only

Maxim Yakimenko · ‎02-27-2007

This is sample script. I am not strong in HTML::TokeParser so I used regexp to get rid of HTML.

#!/usr/local/bin/perl

#lets open file
open SRC, "$ARGV[0]";
#set line delim to undef
#thus we can treat file as a string
$/= undef;
#read data
$data=;
#close file
close SRC;

#take table part string
$data =~ /$.*.*<\/table>.*/igs;
$data = $&;

#get read of html
$data =~ s///ig;
$data =~ s/

/\n/ig;
$data =~ s/<\/table>//ig;
$data =~ s///ig;
$data =~ s/><\/th>/>Column\t/ig;
$data =~ s/<\/th>/\t/ig;
$data =~ s/<\/TD><\/TR>//ig;
$data =~ s/<\/td>/\t/ig;
$data =~ s/<\/tr>//ig;
$data =~ s///ig;
$data =~ s///ig;
$data =~ s/\x20{2,}/\t/ig;
$data =~ s/ /\t/ig;
$data =~ s/\t{2,}/\t/ig;

#for example we want redo size

$data =~ /redo size:\s{1,}([\d\.\,]{1,})\s{1,}([\d\.\,]{1,}).*/is;

#output result
print $1, "\t", "$2";

open SRC, "$ARGV[0]";
$/= undef;
$data=;
close SRC;
$html = HTML::TokeParser->new($data);

Maxim Yakimenko · ‎02-27-2007

Correction
Sample script ends with
#output result
print $1, "\t", "$2";

Theese lines:
open SRC, "$ARGV[0]";
$/= undef;
$data=;
close SRC;
$html = HTML::TokeParser->new($data);
are example on how to get string from file and create parser object over this string.

Dodo_5 · ‎02-27-2007

i have used

#!/usr/local/bin/perl
use strict;
use HTML::TokeParser;

then i run it as :

perl html_parse.pl html

where the script name is "html_parse.pl"
and the "html" is the name of my report file.

still it gives compilation error....please make required change in ur script to avoid error..

error:
Global symbol "$html" requires explicit package name at html_parse.pl line 49.
Execution of html_parse.pl aborted due to compilation errors.

Maxim Yakimenko · ‎02-27-2007

Look at my previos message
Working script is:

#!/usr/local/bin/perl

#lets open file
open SRC, "$ARGV[0]";
#set line delim to undef
#thus we can treat file as a string
$/= undef;
#read data
$data=;
#close file
close SRC;

#take table part string
$data =~ /$.*.*<\/table>.*/igs;
$data = $&;

#get read of html
$data =~ s///ig;
$data =~ s/

/\n/ig;
$data =~ s/<\/table>//ig;
$data =~ s///ig;
$data =~ s/><\/th>/>Column\t/ig;
$data =~ s/<\/th>/\t/ig;
$data =~ s/<\/TD><\/TR>//ig;
$data =~ s/<\/td>/\t/ig;
$data =~ s/<\/tr>//ig;
$data =~ s///ig;
$data =~ s///ig;
$data =~ s/\x20{2,}/\t/ig;
$data =~ s/ /\t/ig;
$data =~ s/\t{2,}/\t/ig;

#for example we want redo size

$data =~ /redo size:\s{1,}([\d\.\,]{1,})\s{1,}([\d\.\,]{1,}).*/is;

#output result
print $1, "\t", "$2";

Dodo_5 · ‎02-27-2007

ok...now ur script is giving the correct value of "redo size" from the html report...but how to get buffer cache value or Memory Usage %(actually other values..).i have changed the $data variable,but it's not working.
u have defined the method for getting "redo size" but it's not valid for other parameters(actually tags are different in different cases).so that values can't be obtained.
so how to make a generalised script.i can't run different script for getting different parameters.there should be only one script (by which the different parameter value can be obtained.

Maxim Yakimenko · ‎02-27-2007

U dont have to write separate script, just add regexp to match another params and add print statements to output then, of course it would be lengty - you must write it for every needed value.

Find in my script this line:
$data =~ /redo size:\s{1,}([\d\.\,]{1,})\s{1,}([\d\.\,]{1,}).*/is;

Block before elimantes HTML, so when things go to this line variable $data contain plain text, and u just have to write expressions to match another values.

Example for redo size says to regexp engine:
find words "redo size:"
after this words would be some spaces
then sequense of digits,commas and dots
then - spaces again
then sequense of digits,commas and dots
I have enclosed sequense of digits,commas and dots in round brackets - this means that matched patern goes to predefined perl vars $1, $2 and so on - first seq to $1 and second to $2

For this look at how to extract matches with perl - google and u'll find a lot of about this.

In script you can take a var for holding result:

#!/usr/local/bin/perl

open SRC, "$ARGV[0]";
$/= undef;
$data=;
close SRC;

#take table string
$data =~ /$.*.*<\/table>.*/igs;
$data = $&;

#get read of html
$data =~ s///ig;
$data =~ s/

/\n/ig;
$data =~ s/<\/table>//ig;
$data =~ s///ig;
$data =~ s/><\/th>/>Column\t/ig;
$data =~ s/<\/th>/\t/ig;
$data =~ s/<\/TD><\/TR>//ig;
$data =~ s/<\/td>/\t/ig;
$data =~ s/<\/tr>//ig;
$data =~ s///ig;
$data =~ s///ig;
$data =~ s/\x20{2,}/\t/ig;
$data =~ s/ /\t/ig;
$data =~ s/\t{2,}/\t/ig;

#for example we want redo size

$result=""

#match redo size Per Second Per Transaction
$data =~ /redo size:\s{1,}([\d\.\,]{1,})\s{1,}([\d\.\,]{1,}).*/is;
$result=$result."\t".$1."\t".$2;
#match Soft Parse %
$data =~ /Soft Parse %:\s{1,}([\d\.\,]{1,}).*/is;
$result=$result."\t".$1;

# and so on

#
#
#add matching for another values here
#
#

#output result
print $result;

Dodo_5 · ‎02-28-2007

first of all many many thanks for ur interest & constant helping...

i got ur point....but
1) first thing is using ur script is lengthy(its ok ,no prob)...but also to set all the parameters for all the different values from table is not a good programming practice.

2)but can we print the values of "redo size" from 20 html files simultaneouly???
its the main requirement...then only i can compare the values in different reports.
i have to get "redo size" values from all html report in output by running the script.

so pls look into the matter...

Maxim Yakimenko · ‎02-28-2007

1) Of cource it is not good programming practice - but it works, anyway - you have no choise, tables are not the same so you must explictly specify what to get and how to name it.

2) I'm trying to tell you that from the begining I look into the question matter :)

You have script that takes specified values out specified file - so run this script for every file in set and save result somewhere - thus you will have ur redo size, extracted out 20 files simult :) strictly speaking "serialy" :) but in one place
you can write batch to do this
like this:

#!/bin/sh
OUTPUT='./output.txt'
cat /dev/null > $OUTPUT
for FILE in `find . -name "*.txt"`;
do
./script_process $FILE >> $OUTPUT
done;

after completing file output.txt will contain values from files.
I wrote it in shell, but this batch can be written in perl too.

Idea is that - write script to process one file and write second script and to call first one for every file in set.

Dodo_5 · ‎02-28-2007

can you just tell how to run the scripts in sequence...
tell the command line statements
i think i have to write first a PERL script named "script_process.pl"
then to write the shell script named "file_process.sh"
then if the name of report file is report.html
then just tell me how to execute one by one for getting correct answer...

Maxim Yakimenko · ‎02-28-2007

I have already told you what to do.
If you did not get about it, lets try another way
Suppose you have 3 reports -
report1.txt
report2.txt
report3.txt
then you run:

create empty output.txt
./script_process.pl report1.txt >> output.txt
./script_process.pl report2.txt >> output.txt
./script_process.pl report3.txt >> output.txt

after it each line in output.txt will contain values extracted from reportN.txt.
Second script that you called file_process.sh do just that - it finds files and calls script_process.pl for each report file one by one. Read more about perl and shell. Then you open output.txt in Excel or another spredsheet processor that understands tab-delimted files and compare what you wish, build diagrams and so on.

Maxim Yakimenko · ‎02-28-2007

Read attentivly scripts I have gave you and try to understand what each script do. Don't just copy them blindely.

Dodo_5 · ‎02-28-2007

hey..everything is ok..now
but i have to set more than 300-400 parameters tag to get their values.
it's really not possible.your script is good for getting two or three required values .
but anyway...thank you very much for sharing ur knowledge and helping a lot.actually i was confused bcos ur script names were not given.
but if it's possible to write a script for getting values(a generalised script)then pls help me.

sed -n "s/.*Buffer Cache:<\/TD><[^>]*> *$[0-9,]*[A-Za-z]*$<\/TD><[^>]*>

*$[0-9,]*[A-Za-z]*$.*/\1 \2/p" report.txt

this command(what i am using now)can also give the correct values for one variable..but its not working for others.so there also i have to change tag properties every time..so i need a generalised script...
thanks..

Maxim Yakimenko · ‎02-28-2007

If so - use perl script to get rid of html tags, after this no tag diffs will be a problem and then use sed to retrieve what you need

Maxim Yakimenko · ‎02-28-2007

Also I wonder - how do you want to create generalised script for report that has different tables?

Dodo_5 · ‎02-28-2007

Have you run my perl script which i have attached...it is also a generalised script which gives a descent output of all the variables from the html file.

Dodo_5 · ‎02-28-2007

if perl script name is "parse_html.pl"
and the source code text file of HTML file is "code.txt"...
then run it as:

perl parse_html.pl code.txt

then it will show you the persed result..
but for multiple file i got stuck

Categories

Company

Local Language

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

Re: PERL for HTML file parsing

PERL for HTML file parsing