Path

ez.no / exponential / documentation / customization / tips & tricks / indexing binary files with ifilters on windows installations


Indexing binary files with IFilters on Windows installations

These documentation pages are no longer maintained. Please visit the new documentation site.

I posted earlier about using Microsoft's IFilters to index binary files on Windows installations of Exponential. After some rummaging about I got it to work - I've tried PPT, DOC, and PDF so far and it works well.

The idea is that instead of using pstotext or pdftotext for PDF files and wvware for Word files, we can use IFilters provided by Microsoft (doc, xls, ppt, vsd, etc.), Adobe (PDF), and many others (SXW, DWG, etc.). Microsoft has a utility called filtdump - it's a command line utility that uses IFilters to dump the text content of a file to stdout.

This hack does require a new plugin for the ezbinaryfile type.

Here are my rough notes:

Install Indexing Service - the service doesn't actually have to be running, but you'll need it installed.

Download the Microsoft Platform SDK. Copy filtdump.exe from the bin directory to a conveniently located folder on the system path.

Override binaryfile.ini as follows:

[HandlerSettings]
MetaDataExtractor[application/msword]=ifilter
MetaDataExtractor[application/vnd.ms-excel]=ifilter
MetaDataExtractor[application/pdf]=ifilter
MetaDataExtractor[application/vnd.ms-powerpoint]=ifilter
MetaDataExtractor[application/vnd.visio]=ifilter

[IFilterHandlerSettings]
TextExtractionTool=filtdump.exe -b

Find the directory exponential/kernel/classes/datatypes/ezbinaryfile/plugins. Copy ezwordparser.php and rename the copy ezifilterparser.php. Replace all instances of "word" with "ifilter".

Change the File class so that the file attribute is searchable.
Increase the maximum query size on MySQL:

[mysqld]
set-variable = max_allowed_packet=16M

Restart MySQL.

Consider turning on delayed indexing and indexing by cron job - it can take a long time to index files when uploading and it gets annoying.

Consider enabling wildcard searches.

Need to set MIME type for Excel in exponential - that's described somewhere on the ez.no site.

Clear all caches.

Manual indexing must be done by copying php.exe from cli directory of the current PHP distribution (zip). Rename it phpcli.exe and place it in the PHP directory used by exponential. From the exponential directory, run the script
..\php\phpcli -C update\common\scripts\updatesearchindex.php --clean

Comments

Contents

Customization

Access control
Exponential API Documentation
Content structure
Custom design
Components
Tips & Tricks
    Debugging templates
    Javascript in templates
    Fetching current user
    Showing related objects
    Show which templates are used
    Fetching the ID of the parent
    One Article Folder
    Creating tree menus
    How can I use my own php script insid...
    Hiding attribute content
    Splitting an article over several pages
    Accessing section id in pagelayout
    List articles in folder
    Creating nice URLs
    Alt tag on images
    Improve the pagelayout.tpl
    Clean HTML tags
    Insert javascript call inside ezpub d...
    3rd party applications
    Fetch Function examples
    Display more than 15 items on your co...
    Including PHP files in templates
    Redirecting after content publishing
    HowTo see if article has an image
    Date and Time Formats
    Forms processing example: store user...
    Adjust Timezone
    Adding a Remove Button/Image
    Find the age in years for an object
    Editing, creating and removing conten...
    Indexing binary files with IFilters o...
Troubleshooting


Created

02/02/2005
10:01:55 pm
by Jonathan Cutting

Last updated

02/02/2005
10:07:06 pm
by Jonathan Cutting

Authors

Jonathan Cutting



This page is part of the Exponential documentation. The documentation is available under the GNU Free Documentation License. All contributions will be released under the terms of this license.