topic Re: Extract text from a PDF in Operating System - HP-UX

Extract text from a PDF

George Spencer_4 — Wed, 11 Mar 2009 23:17:15 GMT

Hi,

Does anyone know of a reliable way of extracting text from a PDF?

We currently use a commercial program, "PDF Plain Text Extractor", to extract text from PDF files supplied by our customers; but find that some files lose pages, and others just give up with no data being extracted. We tried several commercial PC based PDF extractors, but none appear to be able to cope with all the complexities of the PDF format. Adobe's own products are just as unsuccessful, though this could mean that the PDF's are not 100% compliant with the published standard.

I have tested out the various Perl modules from CPAN, in attempt to replace the PC based system (I wrote it, but under duress), with something a little more robust (perl on UNIX; rather than VBA on Windows). Unfortunately, the perl modules also have similar problems with many of the PDF formats, so things have not progressed very far.

What have other users found is the best way of extracting text from PDF's? Hopefully, these are better than printing, and manually typing in the data.

Regards,

George

Re: Extract text from a PDF

Dennis Handly — Thu, 12 Mar 2009 00:33:34 GMT

>What have other users found is the best way of extracting text from PDF's?

Cut & paste from Adobe Reader?

Re: Extract text from a PDF

T G Manikandan — Thu, 12 Mar 2009 09:22:47 GMT

I would use the PDF to Word tool to extract text.

Re: Extract text from a PDF

T G Manikandan — Thu, 12 Mar 2009 09:24:35 GMT

Check http://www.softpedia.com/get/Office-tools/PDF/Easy-PDF-Text-Converter.shtml

Re: Extract text from a PDF

rick jones — Thu, 12 Mar 2009 23:18:13 GMT

In the vein of "Don't raise the bridge, lower the river" I will ask, are these arbitrary PDF files, or are they ones "you" generate? Can the generation process be told to generate text? For example, I try (not always successful :) to write the manual for netperf with texinfo, which can then be asked to generate all matter of output formats.

Re: Extract text from a PDF

George Spencer_4 — Fri, 13 Mar 2009 00:47:03 GMT

Hi Rick,

The PDF files are just one of the formats that we are currently processing from over 80 suppliers. Many of the files sent to us (by e-mail) are the output from the suppliers invoicing packages; and, the role of the PC system is to convert the files into the standard text format used by our mainframe. Currently, we are expected to process files in Word, Excel, HTML, plain-text, or PDF format. Where possible we attempt to get the suppliers to generate a text file; however, the majority of our suppliers are small businesses, and the invoices we receive are generated manually in Excel. It is the medium-sized small businesses that appear to be only capable of generating output in PDF format. Unfortunately, these businesses do not want to know about IT, and if the processing cannot be done at our end, then the invoice has to be printed and the data entered manually.

Regards,
George

Re: Extract text from a PDF

Andrew C Fieldsend — Fri, 13 Mar 2009 09:43:15 GMT

This may not be very helpful, but it may explain why the problem is so difficult.

The PDF format is an extension of the PostScript language, and therefore generates each page by a sequence of instructions to "move to a loaction on the page" and "draw some entity at the current location".

Unfortunately, some PDF generators will display words letter-by-letter, which means that any attempt to generate plain text from the PDF effectively has to try to understand the page as a whole. Also, some PDFs are actually just page images (JPEG, GIF, etc) included in a PDF wrapper.

In either case, this probably explains why the commercial extractors have so much trouble.

Re: Extract text from a PDF

PeterWolfe — Fri, 13 Mar 2009 14:57:58 GMT

Like Rick said, you really are barking up the
wrong tree. Your suppliers certainly have the
data in a format that you want but are converting
it PDF - a format that you don't want. A format
intended to faithfully represent *printed* output,
not one intended for data interchange. And it's
either because your suppliers tools aren't geared
to do what you want or they don't know how/aren't
willing to treat you as a special case.

>Adobe's own products are just as unsuccessful

How so? What Adobe tools have you tried? Adobe's
Acrobat Reader has a Save As Text option and this
typically does an excellent job - or as good a job
as possible given that ASCII can't really properly
represent the original in general.

Windows systems have a Generic/Text Only printer
which you can set up to print to a file. You can
print a Word doc to this printer and just get the
text for example. When I try this with PDF I get
complete rubbish. I expected that to work and
I've seen others have that work so I'm not sure
what's up here.

Lastly, you could print it and OCR it back to
text. Yuck. At that point you really want to lean
harder on your suppliers to get you something that
is not PDF....