- Community Home
- >
- Servers and Operating Systems
- >
- Operating Systems
- >
- Operating System - HP-UX
- >
- Extract text from a PDF
Categories
Company
Local Language
Forums
Discussions
Forums
- Data Protection and Retention
- Entry Storage Systems
- Legacy
- Midrange and Enterprise Storage
- Storage Networking
- HPE Nimble Storage
Discussions
Discussions
Discussions
Forums
Forums
Discussions
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
- BladeSystem Infrastructure and Application Solutions
- Appliance Servers
- Alpha Servers
- BackOffice Products
- Internet Products
- HPE 9000 and HPE e3000 Servers
- Networking
- Netservers
- Secure OS Software for Linux
- Server Management (Insight Manager 7)
- Windows Server 2003
- Operating System - Tru64 Unix
- ProLiant Deployment and Provisioning
- Linux-Based Community / Regional
- Microsoft System Center Integration
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Community
Resources
Forums
Blogs
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО03-11-2009 04:17 PM
тАО03-11-2009 04:17 PM
Extract text from a PDF
Does anyone know of a reliable way of extracting text from a PDF?
We currently use a commercial program, "PDF Plain Text Extractor", to extract text from PDF files supplied by our customers; but find that some files lose pages, and others just give up with no data being extracted. We tried several commercial PC based PDF extractors, but none appear to be able to cope with all the complexities of the PDF format. Adobe's own products are just as unsuccessful, though this could mean that the PDF's are not 100% compliant with the published standard.
I have tested out the various Perl modules from CPAN, in attempt to replace the PC based system (I wrote it, but under duress), with something a little more robust (perl on UNIX; rather than VBA on Windows). Unfortunately, the perl modules also have similar problems with many of the PDF formats, so things have not progressed very far.
What have other users found is the best way of extracting text from PDF's? Hopefully, these are better than printing, and manually typing in the data.
Regards,
George
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО03-11-2009 05:33 PM
тАО03-11-2009 05:33 PM
Re: Extract text from a PDF
Cut & paste from Adobe Reader?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО03-12-2009 02:22 AM
тАО03-12-2009 02:22 AM
Re: Extract text from a PDF
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО03-12-2009 02:24 AM
тАО03-12-2009 02:24 AM
Re: Extract text from a PDF
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО03-12-2009 04:18 PM
тАО03-12-2009 04:18 PM
Re: Extract text from a PDF
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО03-12-2009 05:47 PM
тАО03-12-2009 05:47 PM
Re: Extract text from a PDF
The PDF files are just one of the formats that we are currently processing from over 80 suppliers. Many of the files sent to us (by e-mail) are the output from the suppliers invoicing packages; and, the role of the PC system is to convert the files into the standard text format used by our mainframe. Currently, we are expected to process files in Word, Excel, HTML, plain-text, or PDF format. Where possible we attempt to get the suppliers to generate a text file; however, the majority of our suppliers are small businesses, and the invoices we receive are generated manually in Excel. It is the medium-sized small businesses that appear to be only capable of generating output in PDF format. Unfortunately, these businesses do not want to know about IT, and if the processing cannot be done at our end, then the invoice has to be printed and the data entered manually.
Regards,
George
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО03-13-2009 02:43 AM
тАО03-13-2009 02:43 AM
Re: Extract text from a PDF
The PDF format is an extension of the PostScript language, and therefore generates each page by a sequence of instructions to "move to a loaction on the page" and "draw some entity at the current location".
Unfortunately, some PDF generators will display words letter-by-letter, which means that any attempt to generate plain text from the PDF effectively has to try to understand the page as a whole. Also, some PDFs are actually just page images (JPEG, GIF, etc) included in a PDF wrapper.
In either case, this probably explains why the commercial extractors have so much trouble.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО03-13-2009 07:57 AM
тАО03-13-2009 07:57 AM
Re: Extract text from a PDF
wrong tree. Your suppliers certainly have the
data in a format that you want but are converting
it PDF - a format that you don't want. A format
intended to faithfully represent *printed* output,
not one intended for data interchange. And it's
either because your suppliers tools aren't geared
to do what you want or they don't know how/aren't
willing to treat you as a special case.
>Adobe's own products are just as unsuccessful
How so? What Adobe tools have you tried? Adobe's
Acrobat Reader has a Save As Text option and this
typically does an excellent job - or as good a job
as possible given that ASCII can't really properly
represent the original in general.
Windows systems have a Generic/Text Only printer
which you can set up to print to a file. You can
print a Word doc to this printer and just get the
text for example. When I try this with PDF I get
complete rubbish. I expected that to work and
I've seen others have that work so I'm not sure
what's up here.
Lastly, you could print it and OCR it back to
text. Yuck. At that point you really want to lean
harder on your suppliers to get you something that
is not PDF....