Operating System - HP-UX
1755347 Members
5644 Online
108831 Solutions
New Discussion юеВ

PERL for HTML file parsing

 
SOLVED
Go to solution
Dodo_5
Frequent Advisor

PERL for HTML file parsing

i have a HTML report file..its in attachment(a part of the whole report is

attached..name "input html.doc").also its source is attached in "report

source code.txt"

i just want to seperate the datas like in first line it should be..

NHTEST-3848498958-NHTEST-10.2-no-baloo a
and so on for whole report

i have a perl script.its also attached ,named-"perl coding for

parsing.txt".It can give the required output.

now suppose i have more than 1 file,ie 20 report in html format.and i have

to compare different values of all the tables from different report files

(ie,to compare buffer cache values from different report file).

so how to do that..plss give me some ideas.
i need a script to do this in unix or perl..can you help me in this

regards.
waitin for ur reply

i have used :

sed -n "s/.*Buffer Cache:<\/TD><[^>]*> *\([0-9,]*[A-Za-z]*\)<\/TD><[^>]*>

*\([0-9,]*[A-Za-z]*\).*/\1 \2/p" report.txt

its giving correct values for "buffer cache" but due to tag differences it

can't give correct values for "Redo Size".i think only by help of a script

i can do this...so pls help
40 REPLIES 40
Maxim Yakimenko
Super Advisor

Re: PERL for HTML file parsing

Hi,

I don't just get - if you have text file why you strugle with HTML? Text file have no tags and formatting info - just grep out needed values ("Redo sizes") and compare them.
Dodo_5
Frequent Advisor

Re: PERL for HTML file parsing

i have got the text file for 1 html file..for more than one file i will get more text file.
then how to compare different values from different text files
Maxim Yakimenko
Super Advisor

Re: PERL for HTML file parsing

So, what is the problem? For getting several values for comparision you must process several html files. Insteed of it process several text files.
1 to 1 is good - just grep out needed value from all files and compare them. For example, you can write a script that process one file. Output of this script is a line that contains needed values ("Redo size","Logical reads" and so on) separated by '\t' or comma or what-ever-you-want. Then run this script against all text files and collect output in another file.

IE:

#!/bin/sh
OUTPUT='./output.txt'
cat /dev/null > $OUTPUT
for FILE in `find . -name "*.txt"`;
do
script_process $FILE >> $OUTPUT
done;

In this example "processing_script" - perl script that greps out needed values.

That's all.


Dodo_5
Frequent Advisor

Re: PERL for HTML file parsing

i have a little doubt...

what is script_process $FILE >>OUTPUT
as you wrote "processing_script" as the perl script name.

also if i have to write the required item in -name???

can you just give comments over ur script so that it will be little easy for me.
Maxim Yakimenko
Super Advisor

Re: PERL for HTML file parsing

Oh :) I err

Yes I mean,
script_process is a processing script written in perl that takes argument - file name to process, greps values and this script's output redirected to file $OUTPUT
And also you should point path to processing script. Correct version is:

#!/bin/sh
OUTPUT='./output.txt'
cat /dev/null > $OUTPUT
for FILE in `find . -name "*.txt"`;
do
./script_process $FILE >> $OUTPUT
done;


Command find . -name "*.txt" outputs list of txt files in current directory, you can point another dir - it is just example of how you can tell your script what files to process.
Dodo_5
Frequent Advisor

Re: PERL for HTML file parsing

i inderstand what have you told...that script will get values (like buffer cache,redo size etc) from text file.but forr that i have to run script_process.pl script.
but i actually need that perl script by which i can grep out the values from text file
Maxim Yakimenko
Super Advisor

Re: PERL for HTML file parsing

Addon
If "report source code.txt" is html you must convert it to text - It can be done so:
for each table in html doc
match string that contain entire table
elminate tags and , tags and replace with "\t" and "\n" respectivly. Of coure cut off tag pair
Dodo_5
Frequent Advisor

Re: PERL for HTML file parsing

can you modify my perl script (attached)...for accepting html file as argument
then it will be easy for me...and i can parse any html file giving as an argument only
Maxim Yakimenko
Super Advisor

Re: PERL for HTML file parsing

This is sample script. I am not strong in HTML::TokeParser so I used regexp to get rid of HTML.



#!/usr/local/bin/perl

#lets open file
open SRC, "$ARGV[0]";
#set line delim to undef
#thus we can treat file as a string
$/= undef;
#read data
$data=;
#close file
close SRC;


#take table part string
$data =~ /$.*.*<\/table>.*/igs;
$data = $&;


#get read of html
$data =~ s///ig;
$data =~ s/

/\n/ig;
$data =~ s/<\/table>//ig;
$data =~ s///ig;
$data =~ s/><\/th>/>Column\t/ig;
$data =~ s/<\/th>/\t/ig;
$data =~ s/<\/TD><\/TR>//ig;
$data =~ s/<\/td>/\t/ig;
$data =~ s/<\/tr>//ig;
$data =~ s///ig;
$data =~ s///ig;
$data =~ s/\x20{2,}/\t/ig;
$data =~ s/ /\t/ig;
$data =~ s/\t{2,}/\t/ig;

#for example we want redo size

$data =~ /redo size:\s{1,}([\d\.\,]{1,})\s{1,}([\d\.\,]{1,}).*/is;

#output result
print $1, "\t", "$2";

open SRC, "$ARGV[0]";
$/= undef;
$data=;
close SRC;
$html = HTML::TokeParser->new($data);