<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Extract Data from Flat Text in Operating System - HP-UX</title>
    <link>https://community.hpe.com/t5/operating-system-hp-ux/extract-data-from-flat-text/m-p/6199277#M496178</link>
    <description>&lt;P&gt;There are a *lot* of questions to answer about the data.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Is the data actually plain ASCII or is this an HTML snapshot of the web page?&lt;/P&gt;&lt;P&gt;You say 'record'...is this one text file per screen scrape or are the scrapes hooked together in a long file?&lt;/P&gt;&lt;P&gt;If part of a large file, how do you determine where the each screen starts and stops?&lt;/P&gt;&lt;P&gt;Are there tabs occupying the white space or just spaces?&lt;/P&gt;&lt;P&gt;Does the text have special characters imbedded that are not visible?&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
    <pubDate>Tue, 10 Sep 2013 17:34:36 GMT</pubDate>
    <dc:creator>Bill Hassell</dc:creator>
    <dc:date>2013-09-10T17:34:36Z</dc:date>
    <item>
      <title>Extract Data from Flat Text</title>
      <link>https://community.hpe.com/t5/operating-system-hp-ux/extract-data-from-flat-text/m-p/6196093#M496176</link>
      <description>&lt;P&gt;Been a while since i've been on the ITRC, I am posing a question of the script Gurus.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I have been giving some DATA, originally screen scraps from a client's web location.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;There is ONE page per record, each record is formatted as such:&lt;BR /&gt;&lt;BR /&gt;&lt;IMG src="https://community.hpe.com/t5/image/serverpage/image-id/28231i1F94ED294869C962/image-size/original?v=mpbl-1&amp;amp;px=-1" alt="Capture.PNG" border="0" align="middle" title="Capture.PNG" /&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I would like to pull out the "FIELDS" into a CSV, so I can populate a DB table with them.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;using Awk or Perl, I am looking for suggestions on how to process each line as a "field" and each document as a record.&lt;BR /&gt;&lt;BR /&gt;&lt;/P&gt;&lt;P&gt;Suggestions on Extraction and conversion deeply appreciated.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Fri, 06 Sep 2013 16:11:43 GMT</pubDate>
      <guid>https://community.hpe.com/t5/operating-system-hp-ux/extract-data-from-flat-text/m-p/6196093#M496176</guid>
      <dc:creator>rmueller58</dc:creator>
      <dc:date>2013-09-06T16:11:43Z</dc:date>
    </item>
    <item>
      <title>Re: Extract data from flat text</title>
      <link>https://community.hpe.com/t5/operating-system-hp-ux/extract-data-from-flat-text/m-p/6196417#M496177</link>
      <description>&lt;P&gt;&amp;gt;how to process each line as a "field" and each document as a record.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;The first line seems to have 3 fields, is that a special case?&lt;/P&gt;&lt;P&gt;Also, are those underscores actually there or are they spaces?&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Otherwise, each field seems to be separated by "|".&amp;nbsp;&amp;nbsp; And I suppose the continuation lines have leading spaces.&lt;/P&gt;&lt;P&gt;And blank lines and lines with underscores should be ignored.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Do you have some examples of data, instead of a picture?&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;(While a picture may be worth a 1000 words but in this case it's worthless for scripting since I can't grep it.&lt;/P&gt;&lt;P&gt;Unless OCR works.&amp;nbsp; ;-)&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Sat, 07 Sep 2013 07:49:05 GMT</pubDate>
      <guid>https://community.hpe.com/t5/operating-system-hp-ux/extract-data-from-flat-text/m-p/6196417#M496177</guid>
      <dc:creator>Dennis Handly</dc:creator>
      <dc:date>2013-09-07T07:49:05Z</dc:date>
    </item>
    <item>
      <title>Re: Extract Data from Flat Text</title>
      <link>https://community.hpe.com/t5/operating-system-hp-ux/extract-data-from-flat-text/m-p/6199277#M496178</link>
      <description>&lt;P&gt;There are a *lot* of questions to answer about the data.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Is the data actually plain ASCII or is this an HTML snapshot of the web page?&lt;/P&gt;&lt;P&gt;You say 'record'...is this one text file per screen scrape or are the scrapes hooked together in a long file?&lt;/P&gt;&lt;P&gt;If part of a large file, how do you determine where the each screen starts and stops?&lt;/P&gt;&lt;P&gt;Are there tabs occupying the white space or just spaces?&lt;/P&gt;&lt;P&gt;Does the text have special characters imbedded that are not visible?&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Tue, 10 Sep 2013 17:34:36 GMT</pubDate>
      <guid>https://community.hpe.com/t5/operating-system-hp-ux/extract-data-from-flat-text/m-p/6199277#M496178</guid>
      <dc:creator>Bill Hassell</dc:creator>
      <dc:date>2013-09-10T17:34:36Z</dc:date>
    </item>
    <item>
      <title>Re: Extract Data from Flat Text</title>
      <link>https://community.hpe.com/t5/operating-system-hp-ux/extract-data-from-flat-text/m-p/6200679#M496179</link>
      <description>&lt;P&gt;Bill and Dennis,&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I actually ran HTML through "html2text" then wrote a scriptlet to strip out some other stuff to reduce it to the core&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;the original piece was actually HTML.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;what is seen is ASCII/Text&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Dennis, the&amp;nbsp; '_" is actually underscore, I have tried to field strip with awk -F'_' but it doesn't handle it correctly..&lt;/P&gt;&lt;P&gt;i think this is how the html2text parses spaces..&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;i put it in a hex editor to see what character it was it actually x\5F (underscore.)&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;the HTML code is in fact \x20 or spaces.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;(see attached html)&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Wed, 11 Sep 2013 14:08:08 GMT</pubDate>
      <guid>https://community.hpe.com/t5/operating-system-hp-ux/extract-data-from-flat-text/m-p/6200679#M496179</guid>
      <dc:creator>rmueller58</dc:creator>
      <dc:date>2013-09-11T14:08:08Z</dc:date>
    </item>
    <item>
      <title>Re: Extract data from flat text</title>
      <link>https://community.hpe.com/t5/operating-system-hp-ux/extract-data-from-flat-text/m-p/6201213#M496181</link>
      <description>&lt;P&gt;&amp;gt;I actually ran HTML through "html2text"&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;So that's what you want to process?&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;gt;I have tried to field strip with awk -F'_' but it doesn't handle it correctly.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Your field separator appears to be "|", -F"|".&amp;nbsp; And use gsub to convert "_" back to space.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;gt; (see attached html)&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;(That's missing.&amp;nbsp; You must have a suffix like .txt.)&lt;/P&gt;&lt;P&gt;Can you also provide some example html2text output so we can see what needs to be processed?&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Wed, 11 Sep 2013 19:09:53 GMT</pubDate>
      <guid>https://community.hpe.com/t5/operating-system-hp-ux/extract-data-from-flat-text/m-p/6201213#M496181</guid>
      <dc:creator>Dennis Handly</dc:creator>
      <dc:date>2013-09-11T19:09:53Z</dc:date>
    </item>
    <item>
      <title>Re: Extract Data from Flat Text</title>
      <link>https://community.hpe.com/t5/operating-system-hp-ux/extract-data-from-flat-text/m-p/6206253#M496182</link>
      <description>&lt;P&gt;In Perl I'd probably cycle through each line, checking for pipe symbols (if ( $line =~ /\|/), then doing a split into an array based on the '|' character (@data = split('|',$line);). The individual array elements with data can be globally converted from underscores to spaces ($data[$n] =~ s/_/ /g;), as necessary. You'll then have to put in the logic for merging multiple lines together where appropriate.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Hope this made some sense...&lt;/P&gt;</description>
      <pubDate>Mon, 16 Sep 2013 20:28:41 GMT</pubDate>
      <guid>https://community.hpe.com/t5/operating-system-hp-ux/extract-data-from-flat-text/m-p/6206253#M496182</guid>
      <dc:creator>RJHall</dc:creator>
      <dc:date>2013-09-16T20:28:41Z</dc:date>
    </item>
    <item>
      <title>Re: Extract Data from Flat Text</title>
      <link>https://community.hpe.com/t5/operating-system-hp-ux/extract-data-from-flat-text/m-p/6206255#M496183</link>
      <description>&lt;P&gt;Thanks all.&lt;BR /&gt;&lt;BR /&gt;I think i have it figured out now.&lt;/P&gt;&lt;P&gt;&lt;BR /&gt;Hope all you guys in Colorado are OK!!&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Mon, 16 Sep 2013 20:27:25 GMT</pubDate>
      <guid>https://community.hpe.com/t5/operating-system-hp-ux/extract-data-from-flat-text/m-p/6206255#M496183</guid>
      <dc:creator>rmueller58</dc:creator>
      <dc:date>2013-09-16T20:27:25Z</dc:date>
    </item>
  </channel>
</rss>

