Operating System - HP-UX
1832824 Members
3186 Online
110046 Solutions
New Discussion

Re: Web parsing using perl

 
SOLVED
Go to solution
robert_177
Occasional Contributor

Web parsing using perl

Hi list,
I am using perl to parse weblogs and got the following output, I want to remove all the characters after /documentum/foldercontent.jhtml.

I am using the following regex to parse the requested page
while () {
m{^
\"(?:(-)|http\:\/\/(.*?))"\s+
.*?
$)x

After the first set of parsing, I am trying to run the following to parse and remove everything after .jhtml
$request =~ s/jhtml[a-z0-9A-Z_:@&=+,.!~*'%$]*/jhtml/;

10.15.18.67 | 07/Jul/2003:11:38:19 | /documentum/foldercontents.jhtml?action=folders&path=%2FUsers%2FMember+Policies&objectId=&parentFolder=TOP&objectName=Member+Policies&writeable=false | www.portallink.com/sidebar.jhtml | user1 | XYZ Company, Inc.
10.15.18.67 | 07/Jul/2003:11:38:45 | /documentum/foldercontents.jhtml?path=%2FUsers%2FMember+Policies%2FC&objectId=0b0002768001acc9&action=folders&objectName=C&writeable=false | www.portallink.com/documentum/foldercontents.jhtml?action=folders&path=%2FUsers%2FMember+Policies&objectId=&parentFolder=TOP&objectName=Member+Policies&writeable=false | user1 | XYZ Company, Inc.
10.15.18.67 | 07/Jul/2003:11:39:01 | /documentum/foldercontents.jhtml?path=%2FUsers%2FMember+Policies%2FC%2FCenterPoint+Energy%2C+Inc.&objectId=0b00027680032093&action=folders&objectName=CenterPoint+Energy%2C+Inc.&writeable=false | www.portallink.com/documentum/foldercontents.jhtml?path=%2FUsers%2FMember+Policies%2FC&objectId=0b0002768001acc9&action=folders&objectName=C&writeable=false | user1 | XYZ Company, Inc.

Please help.
what is passat
2 REPLIES 2
Steven E. Protter
Exalted Contributor

Re: Web parsing using perl

I almost never answer these questions because there are people much better than this with me.

In the perl script, prior to processing

$unparse="/documentum/foldercontents.jhtml";

system('sed s/i$unparse//g');

this sed statement, if correct( may need a correction or tweak), will replace the document... string with an empty string.

Now you are ready to parse.

You might also want to look at webalyzer. There is a port for HP-UX at this link.

It might save you some time:
http://hpux.connect.org.uk/hppd/hpux/Networking/WWW/webalizer-2.01.05/

HP moved their public domain software link without notice.

ARRRGGGHHH!

SEP
Steven E Protter
Owner of ISN Corporation
http://isnamerica.com
http://hpuxconsulting.com
Sponsor: http://hpux.ws
Twitter: http://twitter.com/hpuxlinux
Founder http://newdatacloud.com
Patrice Bedard
New Member
Solution

Re: Web parsing using perl

Hi,
of what I read, you had to ouput at the end of you first parsing, from there you wanted to remove everything after the .jhtml . Here is a simple code I made to do it. I had saved your output in a file (outpout.log):

#!/usr/bin/perl

open (LOG ,"outpout.log");

while ()
{
$_ =~ s/\.jhtml(.*)$/jhtml/ ;
print $_ . "\n";

}

and I had the following output:

10.15.18.67 | 07/Jul/2003:11:38:19 | /documentum/foldercontentsjhtml

10.15.18.67 | 07/Jul/2003:11:38:45 | /documentum/foldercontentsjhtml

10.15.18.67 | 07/Jul/2003:11:39:01 | /documentum/foldercontentsjhtml

I hope it will help you.

Have a nice day.