Path

ez.no / ezpublish / documentation / configuration / optimization / speeding up acrobat pdf document indexing


Speeding up acrobat pdf document indexing

These documentation pages are no longer maintained. Please visit the new documentation site.

When you make binary files searchable in release 3.2, you may notice a severe performance degradation when indexing pdf files. It will also take a lot of time on larger documents before your browser even returns when publishing an object which a searchable binary file attribute.

This is mainly due to the default use of pstotext which is based on ghostscript. For lots of pdf documents, the output is mainly garbage which is hard to index by eZ publish.

It is much better to use pdftotext from the xpdf project. However, to use it with eZ publish, you need to keep in mind the command line arguments needed.

Normally pdftotext expects an input filename and an output filename. However, the eZ publish implementation expects the output on stdout which is specified with pdftotext with a -

To make pdftotext work with ez publish, you need to create a small script, say ezpdftotext and put it in a place where the webserver/php can find it.

The contents of this file should be

#!/bin/sh

#ezpdftotext script

pdftotext $1 -

(You may need to specify the full path of "pdftotext" in the last line of the shell script above)

You can download xpdf with pdftotext from

http://www.foolabs.com/xpdf/download.html

if you do not have it already of course.
Note. Redhat 7.3 users will need to download source and compile as the shipped version of pdftotext does not convert PDF files very well.
[Tested on Openoffice 1.1rc4 output Screen optimised PDF files]

Your binaryfile.ini.append in the settings/override directory should include the following statements

[HandlerSettings]

#....

MetaDataExtractor[application/pdf]=pdf

#....

# The path to the text extraction tool to use to 

# fetch the information in PDF files

[PDFHandlerSettings]

TextExtractionTool=ezpdftotext

Voila, happy pdf indexing (after clearing the ini cache)!

Comments

Tips

Class: Default File class sets the is searchable attribute to false. I created a new class called PDF, and set the is searchable attribute to true.

Paths: Be sure to include the full pathnames in both the binary.ini.append.php and the ezpdftotext code.

PDFs: Run pdftotext from the command line to ensure the PDF file is indexable. Some PDFs are protected from copying, and xpdf respects that. If you need to index the PDFs, you may need a second copy of the document to index, then post the protected version.

Thanks to the other posters who made this an easy task. :)

Error

There is a error in the doc (at least applied to 3.8). There should be



[HandlerSettings]


MetaDataExtractor[application/pdf]=ezpdf



instead of



MetaDataExtractor[application/pdf]=pdf




But this line is not necessary indeed cause this setting is included in default config file.

Bold text and soft hyphens will likely disrupt indexing

I don't know the eZ publish indexer, but I doubt it will properly compensate for pdftotext's default treatment of bold text, which is to double-space it l i k e t h i s. To avoid this, add the command-line argument -raw to your ezpdftotext script.

Additionally, soft hyphens at line breaks will probably not index correctly. Using a short sed script we can strip all hyphens followed by newlines. The following script combines the two fixes:




#!/bin/sh


#ezpdftotext script


pdftotext -raw $1 - | sed ':a; /-$/N; s/-\n//; ta'



Note that this will probably not distinguish between soft and hard hyphens, mean that if the word "use-case" is split onto two lines, it will be combined into "usecase".

Still having trouble

Hi,

I followed the instructions above for a Linux system, and still am having trouble uploading PDFs larger than about 1Mb.

Both Apache and PHP are set up to accept uploads up to 25 Mb.

I changed the use of pstotext to pdftotext but still, after I "send for publishing" the article containing the PDF dile, either the browser says "done" on the submitting screen itself, i.e., I don't get to the next screen, and the file is not uploaded, or I get a "A database transaction in eZ publish failed...." message.

In order to see if the problem was related to the use of either pstotext or pdftotext, I timed the conversion of one of such problematic PDF files into text, and I show the results below. The PDF filesize is 1.291.379, which is not soooo big.

tmp# time pstotext senhora.pdf > senhora.txt
real 0m35.893s
user 0m33.070s
sys 0m0.920s

# time pdftotext senhora.pdf senhora.txt2
real 0m1.731s
user 0m1.450s
sys 0m0.040s

So as one can see, although pdftotext is really a lot faster than pstotext, this latter takes "only" 35.9 seconds to do its job. Pdftotext takes much less, 1.7s. So I ask, is this part of the uploading job really what's causing the system to lock? I don't think so.

Another clue is: when I go to the class and turn off the "Searchable" flag for the file attribute, everything goes very smooth... but then the file is not searchable!!!

And finally another hint is: Inspecting the database I see a very suspicious table size for table "ezsearch_object_word_link": 4,571,489 records / 817.5 MB. I guess this is the table that stores the indexed material, but at that size would it slow down indexing of new material?

I'd appreciate any help, as my system is heavily based on PDFs.

Roberto

Windows users

Instead of creating ezpdftotext, you should create ezpdftotext.bat

Content:



pdftotext.exe %1 -


Contents

Configuration

Configuration
Security
Optimization
    PHP Acceleration
    Disabling the cache
    Configuration tuning
    Caching for improved speed
    Speeding up acrobat pdf document inde...
Backup & Restore
Troubleshooting


Created

11/09/2003
9:47:30 pm
by Paul Borgermans

Last updated

26/09/2003
10:01:20 pm
by Marco Zinn

Authors

Paul Borgermans
Tony Wood
Marco Zinn



This page is part of the eZ Publish documentation. The documentation is available under the GNU Free Documentation License. All contributions will be released under the terms of this license.