- Community Home
- >
- Servers and Operating Systems
- >
- Operating Systems
- >
- Operating System - HP-UX
- >
- Re: Extract Data from Flat Text
Categories
Company
Local Language
Forums
Discussions
Forums
- Data Protection and Retention
- Entry Storage Systems
- Legacy
- Midrange and Enterprise Storage
- Storage Networking
- HPE Nimble Storage
Discussions
Discussions
Discussions
Forums
Discussions
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
- BladeSystem Infrastructure and Application Solutions
- Appliance Servers
- Alpha Servers
- BackOffice Products
- Internet Products
- HPE 9000 and HPE e3000 Servers
- Networking
- Netservers
- Secure OS Software for Linux
- Server Management (Insight Manager 7)
- Windows Server 2003
- Operating System - Tru64 Unix
- ProLiant Deployment and Provisioning
- Linux-Based Community / Regional
- Microsoft System Center Integration
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Community
Resources
Forums
Blogs
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
09-06-2013 09:11 AM
09-06-2013 09:11 AM
Extract Data from Flat Text
Been a while since i've been on the ITRC, I am posing a question of the script Gurus.
I have been giving some DATA, originally screen scraps from a client's web location.
There is ONE page per record, each record is formatted as such:
I would like to pull out the "FIELDS" into a CSV, so I can populate a DB table with them.
using Awk or Perl, I am looking for suggestions on how to process each line as a "field" and each document as a record.
Suggestions on Extraction and conversion deeply appreciated.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
09-07-2013 12:48 AM - edited 09-07-2013 12:49 AM
09-07-2013 12:48 AM - edited 09-07-2013 12:49 AM
Re: Extract data from flat text
>how to process each line as a "field" and each document as a record.
The first line seems to have 3 fields, is that a special case?
Also, are those underscores actually there or are they spaces?
Otherwise, each field seems to be separated by "|". And I suppose the continuation lines have leading spaces.
And blank lines and lines with underscores should be ignored.
Do you have some examples of data, instead of a picture?
(While a picture may be worth a 1000 words but in this case it's worthless for scripting since I can't grep it.
Unless OCR works. ;-)
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
09-10-2013 10:34 AM
09-10-2013 10:34 AM
Re: Extract Data from Flat Text
There are a *lot* of questions to answer about the data.
Is the data actually plain ASCII or is this an HTML snapshot of the web page?
You say 'record'...is this one text file per screen scrape or are the scrapes hooked together in a long file?
If part of a large file, how do you determine where the each screen starts and stops?
Are there tabs occupying the white space or just spaces?
Does the text have special characters imbedded that are not visible?
Bill Hassell, sysadmin
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
09-11-2013 06:31 AM - edited 09-11-2013 07:08 AM
09-11-2013 06:31 AM - edited 09-11-2013 07:08 AM
Re: Extract Data from Flat Text
Bill and Dennis,
I actually ran HTML through "html2text" then wrote a scriptlet to strip out some other stuff to reduce it to the core
the original piece was actually HTML.
what is seen is ASCII/Text
Dennis, the '_" is actually underscore, I have tried to field strip with awk -F'_' but it doesn't handle it correctly..
i think this is how the html2text parses spaces..
i put it in a hex editor to see what character it was it actually x\5F (underscore.)
the HTML code is in fact \x20 or spaces.
(see attached html)
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
09-11-2013 12:09 PM
09-11-2013 12:09 PM
Re: Extract data from flat text
>I actually ran HTML through "html2text"
So that's what you want to process?
>I have tried to field strip with awk -F'_' but it doesn't handle it correctly.
Your field separator appears to be "|", -F"|". And use gsub to convert "_" back to space.
> (see attached html)
(That's missing. You must have a suffix like .txt.)
Can you also provide some example html2text output so we can see what needs to be processed?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
09-16-2013 01:24 PM - edited 09-16-2013 01:28 PM
09-16-2013 01:24 PM - edited 09-16-2013 01:28 PM
Re: Extract Data from Flat Text
In Perl I'd probably cycle through each line, checking for pipe symbols (if ( $line =~ /\|/), then doing a split into an array based on the '|' character (@data = split('|',$line);). The individual array elements with data can be globally converted from underscores to spaces ($data[$n] =~ s/_/ /g;), as necessary. You'll then have to put in the logic for merging multiple lines together where appropriate.
Hope this made some sense...
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
09-16-2013 01:27 PM
09-16-2013 01:27 PM
Re: Extract Data from Flat Text
Thanks all.
I think i have it figured out now.
Hope all you guys in Colorado are OK!!