HPE GreenLake Administration
- Community Home
- >
- Servers and Operating Systems
- >
- Operating Systems
- >
- Operating System - HP-UX
- >
- Web parsing using perl
Operating System - HP-UX
1832864
Members
2762
Online
110048
Solutions
Forums
Categories
Company
Local Language
back
Forums
Discussions
Forums
- Data Protection and Retention
- Entry Storage Systems
- Legacy
- Midrange and Enterprise Storage
- Storage Networking
- HPE Nimble Storage
Discussions
Forums
Discussions
Discussions
Discussions
Forums
Discussions
back
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
- BladeSystem Infrastructure and Application Solutions
- Appliance Servers
- Alpha Servers
- BackOffice Products
- Internet Products
- HPE 9000 and HPE e3000 Servers
- Networking
- Netservers
- Secure OS Software for Linux
- Server Management (Insight Manager 7)
- Windows Server 2003
- Operating System - Tru64 Unix
- ProLiant Deployment and Provisioning
- Linux-Based Community / Regional
- Microsoft System Center Integration
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Blogs
Information
Community
Resources
Community Language
Language
Forums
Blogs
Go to solution
Topic Options
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
07-08-2003 11:58 AM
07-08-2003 11:58 AM
Hi list,
I am using perl to parse weblogs and got the following output, I want to remove all the characters after /documentum/foldercontent.jhtml.
I am using the following regex to parse the requested page
while () {
m{^
\"(?:(-)|http\:\/\/(.*?))"\s+
.*?
$)x
After the first set of parsing, I am trying to run the following to parse and remove everything after .jhtml
$request =~ s/jhtml[a-z0-9A-Z_:@&=+,.!~*'%$]*/jhtml/;
10.15.18.67 | 07/Jul/2003:11:38:19 | /documentum/foldercontents.jhtml?action=folders&path=%2FUsers%2FMember+Policies&objectId=&parentFolder=TOP&objectName=Member+Policies&writeable=false | www.portallink.com/sidebar.jhtml | user1 | XYZ Company, Inc.
10.15.18.67 | 07/Jul/2003:11:38:45 | /documentum/foldercontents.jhtml?path=%2FUsers%2FMember+Policies%2FC&objectId=0b0002768001acc9&action=folders&objectName=C&writeable=false | www.portallink.com/documentum/foldercontents.jhtml?action=folders&path=%2FUsers%2FMember+Policies&objectId=&parentFolder=TOP&objectName=Member+Policies&writeable=false | user1 | XYZ Company, Inc.
10.15.18.67 | 07/Jul/2003:11:39:01 | /documentum/foldercontents.jhtml?path=%2FUsers%2FMember+Policies%2FC%2FCenterPoint+Energy%2C+Inc.&objectId=0b00027680032093&action=folders&objectName=CenterPoint+Energy%2C+Inc.&writeable=false | www.portallink.com/documentum/foldercontents.jhtml?path=%2FUsers%2FMember+Policies%2FC&objectId=0b0002768001acc9&action=folders&objectName=C&writeable=false | user1 | XYZ Company, Inc.
Please help.
I am using perl to parse weblogs and got the following output, I want to remove all the characters after /documentum/foldercontent.jhtml.
I am using the following regex to parse the requested page
while (
m{^
\"(?:(-)|http\:\/\/(.*?))"\s+
.*?
$)x
After the first set of parsing, I am trying to run the following to parse and remove everything after .jhtml
$request =~ s/jhtml[a-z0-9A-Z_:@&=+,.!~*'%$]*/jhtml/;
10.15.18.67 | 07/Jul/2003:11:38:19 | /documentum/foldercontents.jhtml?action=folders&path=%2FUsers%2FMember+Policies&objectId=&parentFolder=TOP&objectName=Member+Policies&writeable=false | www.portallink.com/sidebar.jhtml | user1 | XYZ Company, Inc.
10.15.18.67 | 07/Jul/2003:11:38:45 | /documentum/foldercontents.jhtml?path=%2FUsers%2FMember+Policies%2FC&objectId=0b0002768001acc9&action=folders&objectName=C&writeable=false | www.portallink.com/documentum/foldercontents.jhtml?action=folders&path=%2FUsers%2FMember+Policies&objectId=&parentFolder=TOP&objectName=Member+Policies&writeable=false | user1 | XYZ Company, Inc.
10.15.18.67 | 07/Jul/2003:11:39:01 | /documentum/foldercontents.jhtml?path=%2FUsers%2FMember+Policies%2FC%2FCenterPoint+Energy%2C+Inc.&objectId=0b00027680032093&action=folders&objectName=CenterPoint+Energy%2C+Inc.&writeable=false | www.portallink.com/documentum/foldercontents.jhtml?path=%2FUsers%2FMember+Policies%2FC&objectId=0b0002768001acc9&action=folders&objectName=C&writeable=false | user1 | XYZ Company, Inc.
Please help.
what is passat
Solved! Go to Solution.
2 REPLIES 2
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
07-08-2003 12:21 PM
07-08-2003 12:21 PM
Re: Web parsing using perl
I almost never answer these questions because there are people much better than this with me.
In the perl script, prior to processing
$unparse="/documentum/foldercontents.jhtml";
system('sed s/i$unparse//g');
this sed statement, if correct( may need a correction or tweak), will replace the document... string with an empty string.
Now you are ready to parse.
You might also want to look at webalyzer. There is a port for HP-UX at this link.
It might save you some time:
http://hpux.connect.org.uk/hppd/hpux/Networking/WWW/webalizer-2.01.05/
HP moved their public domain software link without notice.
ARRRGGGHHH!
SEP
In the perl script, prior to processing
$unparse="/documentum/foldercontents.jhtml";
system('sed s/i$unparse//g');
this sed statement, if correct( may need a correction or tweak), will replace the document... string with an empty string.
Now you are ready to parse.
You might also want to look at webalyzer. There is a port for HP-UX at this link.
It might save you some time:
http://hpux.connect.org.uk/hppd/hpux/Networking/WWW/webalizer-2.01.05/
HP moved their public domain software link without notice.
ARRRGGGHHH!
SEP
Steven E Protter
Owner of ISN Corporation
http://isnamerica.com
http://hpuxconsulting.com
Sponsor: http://hpux.ws
Twitter: http://twitter.com/hpuxlinux
Founder http://newdatacloud.com
Owner of ISN Corporation
http://isnamerica.com
http://hpuxconsulting.com
Sponsor: http://hpux.ws
Twitter: http://twitter.com/hpuxlinux
Founder http://newdatacloud.com
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
07-08-2003 12:36 PM
07-08-2003 12:36 PM
Solution
Hi,
of what I read, you had to ouput at the end of you first parsing, from there you wanted to remove everything after the .jhtml . Here is a simple code I made to do it. I had saved your output in a file (outpout.log):
#!/usr/bin/perl
open (LOG ,"outpout.log");
while ()
{
$_ =~ s/\.jhtml(.*)$/jhtml/ ;
print $_ . "\n";
}
and I had the following output:
10.15.18.67 | 07/Jul/2003:11:38:19 | /documentum/foldercontentsjhtml
10.15.18.67 | 07/Jul/2003:11:38:45 | /documentum/foldercontentsjhtml
10.15.18.67 | 07/Jul/2003:11:39:01 | /documentum/foldercontentsjhtml
I hope it will help you.
Have a nice day.
of what I read, you had to ouput at the end of you first parsing, from there you wanted to remove everything after the .jhtml . Here is a simple code I made to do it. I had saved your output in a file (outpout.log):
#!/usr/bin/perl
open (LOG ,"outpout.log");
while (
{
$_ =~ s/\.jhtml(.*)$/jhtml/ ;
print $_ . "\n";
}
and I had the following output:
10.15.18.67 | 07/Jul/2003:11:38:19 | /documentum/foldercontentsjhtml
10.15.18.67 | 07/Jul/2003:11:38:45 | /documentum/foldercontentsjhtml
10.15.18.67 | 07/Jul/2003:11:39:01 | /documentum/foldercontentsjhtml
I hope it will help you.
Have a nice day.
The opinions expressed above are the personal opinions of the authors, not of Hewlett Packard Enterprise. By using this site, you accept the Terms of Use and Rules of Participation.
Company
Events and news
Customer resources
© Copyright 2025 Hewlett Packard Enterprise Development LP