Configuring binary file indexing

Note 1: Binaryfile indexing is available from 3.2.

Note 2: Before uploading any binary file, make sure, that the class attribute, that holds the file, is "searchable".

Note 3: As of 3.2 final, the filename itself is not indexed. But -for "file" objects- it will be found by the search engine, because usually the object name is taken from the filename.

The indexing engine currently supports plaintext, PDF and MS Word documents.

All settings for binaryfile indexing are found in the ini file binaryfile.ini.

plaintext

Plain text documents do not require any configuration as it will be used for all file that have the MIME-Type text/plain.

The default setting for plain text is

[HandlerSettings]

MetaDataExtractor[text/plain]=plaintext

PDF

PDF files are handled using external programs which returns the content of the PDF file as plain text. This has been tested with the pstotext und pdftotext programs, but should work with others as well.

pstotext

The pstotext program can be found either on freshmeat.net or the pstotext homepage.
http://freshmeat.net/projects/pstotext/?topic_id=849
http://research.compaq.com/SRC/virtualpaper/pstotext.html

The default settings for using pstotext are

[HandlerSettings]

MetaDataExtractor[application/pdf]=pdf

 

[PDFHandlerSettings]

TextExtractionTool=pstotext

pdftotext

Another option for indexing PDF files is pdftotext from the xpdf project. Read more about that here.

MS Word

MS word documents are also handled using external programs. This feature requires the wv program to work properly.

wv

The word view program can be found either on freshmeat.net or the wv homepage.
http://freshmeat.net/projects/wv/?topic_id=849
http://wvWare.sourceforge.net/

The default settings for using wv are

[HandlerSettings]

MetaDataExtractor[application/msword]=word

 

[WordHandlerSettings]

TextExtractionTool=wvWare -x /usr/local/wv/wvText.xml

Alternatives:

You may consides these alternative indexers:

For MS Word: antiword (http://www.winfield.demon.nl/ or http://www.antiword.org )
For MS Excel: XLhtml (http://chicago.sourceforge.net/xlhtml/)
For MS PowerPoint: XLhtml (http://chicago.sourceforge.net/xlhtml/), which also includes a PPT->HTML converter

For XLHtml, you will need a html->Text-Converter. You can use a webbrowser like lynx or w3m for this.

As soon as we have set this up completly (on ez 3.2) and integrated everything with ez3.2, i will give more detailled info.

Issues:

For problems with large files, see http://ez.no/developer/exponential_3/bug_reports/weird_search_limitations_binary_file

Note, that indexing relies on correct MIME-Types. As of 3.2-3, the MIME-Type for MS Excel-Files was not set.

Comments

PowerPoint extraction

Brendan Pike

Thursday 29 April 2004 7:46:16 am

Hi I'm very interested getting exponential to index powerpoint, did you succeed in this? If so could you please add details of how to acheive this. Thanks :)

Configuration

Configuration
    WebDAV setup
    Exponential running on a CGI version o...
    Path prefix
    Locale Settings
    Introduction
    Configuration files
    Site access
    Common settings
    Multi Site
    Directory structure
    Language and charset
    Cron jobs
    Login handler
    Search engine
       Configuring binary file indexing
          Solving problems with binary file ind...
       Delayed Indexing
       Wildcard search
    Tips & Tricks
Security
Optimization
Backup & Restore
Troubleshooting

Created

26/08/2003
5:11:34 pm
by Jan Borsodi

Last updated

02/06/2004
11:10:56 am
by Terje Gunrell-Kaste

Authors

Jan Borsodi
Marco Zinn
Terje Gunrell-Kaste

This page is part of the Exponential documentation. The documentation is available under the GNU Free Documentation License. All contributions will be released under the terms of this license.

7x

Share your information

Main menu

Sub menu

Products

Path