<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Extract text from a PDF in Operating System - HP-UX</title>
    <link>https://community.hpe.com/t5/operating-system-hp-ux/extract-text-from-a-pdf/m-p/4377258#M664571</link>
    <description>&amp;gt;What have other users found is the best way of extracting text from PDF's?&lt;BR /&gt;&lt;BR /&gt;Cut &amp;amp; paste from Adobe Reader?</description>
    <pubDate>Thu, 12 Mar 2009 00:33:34 GMT</pubDate>
    <dc:creator>Dennis Handly</dc:creator>
    <dc:date>2009-03-12T00:33:34Z</dc:date>
    <item>
      <title>Extract text from a PDF</title>
      <link>https://community.hpe.com/t5/operating-system-hp-ux/extract-text-from-a-pdf/m-p/4377257#M664570</link>
      <description>Hi,&lt;BR /&gt;&lt;BR /&gt;Does anyone know of a reliable way of extracting text from a PDF?&lt;BR /&gt;&lt;BR /&gt;We currently use a commercial program, "PDF Plain Text Extractor", to extract text from PDF files supplied by our customers; but find that some files lose pages, and others just give up with no data being extracted. We tried several commercial PC based PDF extractors, but none appear to be able to cope with all the complexities of the PDF format. Adobe's own products are just as unsuccessful, though this could mean that the PDF's are not 100% compliant with the published standard.&lt;BR /&gt;&lt;BR /&gt;I have tested out the various Perl modules from CPAN, in attempt to replace the PC based system (I wrote it, but under duress), with something a little more robust (perl on UNIX; rather than VBA on Windows). Unfortunately, the perl modules also have similar problems with many of the PDF formats, so things have not progressed very far.&lt;BR /&gt;&lt;BR /&gt;What have other users found is the best way of extracting text from PDF's? Hopefully, these are better than printing, and manually typing in the data.&lt;BR /&gt;&lt;BR /&gt;Regards,&lt;BR /&gt;&lt;BR /&gt;George</description>
      <pubDate>Wed, 11 Mar 2009 23:17:15 GMT</pubDate>
      <guid>https://community.hpe.com/t5/operating-system-hp-ux/extract-text-from-a-pdf/m-p/4377257#M664570</guid>
      <dc:creator>George Spencer_4</dc:creator>
      <dc:date>2009-03-11T23:17:15Z</dc:date>
    </item>
    <item>
      <title>Re: Extract text from a PDF</title>
      <link>https://community.hpe.com/t5/operating-system-hp-ux/extract-text-from-a-pdf/m-p/4377258#M664571</link>
      <description>&amp;gt;What have other users found is the best way of extracting text from PDF's?&lt;BR /&gt;&lt;BR /&gt;Cut &amp;amp; paste from Adobe Reader?</description>
      <pubDate>Thu, 12 Mar 2009 00:33:34 GMT</pubDate>
      <guid>https://community.hpe.com/t5/operating-system-hp-ux/extract-text-from-a-pdf/m-p/4377258#M664571</guid>
      <dc:creator>Dennis Handly</dc:creator>
      <dc:date>2009-03-12T00:33:34Z</dc:date>
    </item>
    <item>
      <title>Re: Extract text from a PDF</title>
      <link>https://community.hpe.com/t5/operating-system-hp-ux/extract-text-from-a-pdf/m-p/4377259#M664572</link>
      <description>I would use the PDF to Word tool to extract text.</description>
      <pubDate>Thu, 12 Mar 2009 09:22:47 GMT</pubDate>
      <guid>https://community.hpe.com/t5/operating-system-hp-ux/extract-text-from-a-pdf/m-p/4377259#M664572</guid>
      <dc:creator>T G Manikandan</dc:creator>
      <dc:date>2009-03-12T09:22:47Z</dc:date>
    </item>
    <item>
      <title>Re: Extract text from a PDF</title>
      <link>https://community.hpe.com/t5/operating-system-hp-ux/extract-text-from-a-pdf/m-p/4377260#M664573</link>
      <description>Check &lt;A href="http://www.softpedia.com/get/Office-tools/PDF/Easy-PDF-Text-Converter.shtml" target="_blank"&gt;http://www.softpedia.com/get/Office-tools/PDF/Easy-PDF-Text-Converter.shtml&lt;/A&gt;</description>
      <pubDate>Thu, 12 Mar 2009 09:24:35 GMT</pubDate>
      <guid>https://community.hpe.com/t5/operating-system-hp-ux/extract-text-from-a-pdf/m-p/4377260#M664573</guid>
      <dc:creator>T G Manikandan</dc:creator>
      <dc:date>2009-03-12T09:24:35Z</dc:date>
    </item>
    <item>
      <title>Re: Extract text from a PDF</title>
      <link>https://community.hpe.com/t5/operating-system-hp-ux/extract-text-from-a-pdf/m-p/4377261#M664574</link>
      <description>In the vein of "Don't raise the bridge, lower the river" I will ask, are these arbitrary PDF files, or are they ones "you" generate?  Can the generation process be told to generate text?  For example, I try (not always successful :) to write the manual for netperf with texinfo, which can then be asked to generate all matter of output formats.</description>
      <pubDate>Thu, 12 Mar 2009 23:18:13 GMT</pubDate>
      <guid>https://community.hpe.com/t5/operating-system-hp-ux/extract-text-from-a-pdf/m-p/4377261#M664574</guid>
      <dc:creator>rick jones</dc:creator>
      <dc:date>2009-03-12T23:18:13Z</dc:date>
    </item>
    <item>
      <title>Re: Extract text from a PDF</title>
      <link>https://community.hpe.com/t5/operating-system-hp-ux/extract-text-from-a-pdf/m-p/4377262#M664575</link>
      <description>Hi Rick,&lt;BR /&gt;&lt;BR /&gt;The PDF files are just one of the formats that we are currently processing from over 80 suppliers. Many of the files sent to us (by e-mail) are the output from the suppliers invoicing packages; and, the role of the PC system is to convert the files into the standard text format used by our mainframe. Currently, we are expected to process files in Word, Excel, HTML, plain-text, or PDF format. Where possible we attempt to get the suppliers to generate a text file; however, the majority of our suppliers are small businesses, and the invoices we receive are generated manually in Excel. It is the medium-sized small businesses that appear to be only capable of generating output in PDF format. Unfortunately, these businesses do not want to know about IT, and if the processing cannot be done at our end, then the invoice has to be printed and the data entered manually.&lt;BR /&gt;&lt;BR /&gt;Regards,&lt;BR /&gt;George</description>
      <pubDate>Fri, 13 Mar 2009 00:47:03 GMT</pubDate>
      <guid>https://community.hpe.com/t5/operating-system-hp-ux/extract-text-from-a-pdf/m-p/4377262#M664575</guid>
      <dc:creator>George Spencer_4</dc:creator>
      <dc:date>2009-03-13T00:47:03Z</dc:date>
    </item>
    <item>
      <title>Re: Extract text from a PDF</title>
      <link>https://community.hpe.com/t5/operating-system-hp-ux/extract-text-from-a-pdf/m-p/4377263#M664576</link>
      <description>This may not be very helpful, but it may explain why the problem is so difficult.&lt;BR /&gt;&lt;BR /&gt;The PDF format is an extension of the PostScript language, and therefore generates each page by a sequence of instructions to "move to a loaction on the page" and "draw some entity at the current location".&lt;BR /&gt;&lt;BR /&gt;Unfortunately, some PDF generators will display words letter-by-letter, which means that any attempt to generate plain text from the PDF effectively has to try to understand the page as a whole. Also, some PDFs are actually just page images (JPEG, GIF, etc) included in a PDF wrapper.&lt;BR /&gt;&lt;BR /&gt;In either case, this probably explains why the commercial extractors have so much trouble.</description>
      <pubDate>Fri, 13 Mar 2009 09:43:15 GMT</pubDate>
      <guid>https://community.hpe.com/t5/operating-system-hp-ux/extract-text-from-a-pdf/m-p/4377263#M664576</guid>
      <dc:creator>Andrew C Fieldsend</dc:creator>
      <dc:date>2009-03-13T09:43:15Z</dc:date>
    </item>
    <item>
      <title>Re: Extract text from a PDF</title>
      <link>https://community.hpe.com/t5/operating-system-hp-ux/extract-text-from-a-pdf/m-p/4377264#M664577</link>
      <description>&lt;!--!*#--&gt;Like Rick said, you really are barking up the&lt;BR /&gt;wrong tree.  Your suppliers certainly have the&lt;BR /&gt;data in a format that you want but are converting&lt;BR /&gt;it PDF - a format that you don't want. A format&lt;BR /&gt;intended to faithfully represent *printed* output,&lt;BR /&gt;not one intended for data interchange. And it's&lt;BR /&gt;either because your suppliers tools aren't geared&lt;BR /&gt;to do what you want or they don't know how/aren't&lt;BR /&gt;willing to treat you as a special case.&lt;BR /&gt;&lt;BR /&gt;&amp;gt;Adobe's own products are just as unsuccessful&lt;BR /&gt;&lt;BR /&gt;How so? What Adobe tools have you tried? Adobe's&lt;BR /&gt;Acrobat Reader has a Save As Text option and this&lt;BR /&gt;typically does an excellent job - or as good a job&lt;BR /&gt;as possible given that ASCII can't really properly&lt;BR /&gt;represent the original in general.&lt;BR /&gt;&lt;BR /&gt;Windows systems have a Generic/Text Only printer&lt;BR /&gt;which you can set up to print to a file. You can&lt;BR /&gt;print a Word doc to this printer and just get the&lt;BR /&gt;text for example. When I try this with PDF I get&lt;BR /&gt;complete rubbish.  I expected that to work and&lt;BR /&gt;I've seen others have that work so I'm not sure&lt;BR /&gt;what's up here.&lt;BR /&gt;&lt;BR /&gt;Lastly, you could print it and OCR it back to&lt;BR /&gt;text. Yuck. At that point you really want to lean&lt;BR /&gt;harder on your suppliers to get you something that&lt;BR /&gt;is not PDF....</description>
      <pubDate>Fri, 13 Mar 2009 14:57:58 GMT</pubDate>
      <guid>https://community.hpe.com/t5/operating-system-hp-ux/extract-text-from-a-pdf/m-p/4377264#M664577</guid>
      <dc:creator>PeterWolfe</dc:creator>
      <dc:date>2009-03-13T14:57:58Z</dc:date>
    </item>
  </channel>
</rss>

