ez.no / exponential / documentation / configuration / configuration / search engine / configuring binary file ind...
Note 1: Binaryfile indexing is available from 3.2.
Note 2: Before uploading any binary file, make sure, that the class attribute, that holds the file, is "searchable".
Note 3: As of 3.2 final, the filename itself is not indexed. But -for "file" objects- it will be found by the search engine, because usually the object name is taken from the filename.
The indexing engine currently supports plaintext, PDF and MS Word documents.
All settings for binaryfile indexing are found in the ini file binaryfile.ini.
Plain text documents do not require any configuration as it will be used for all file that have the MIME-Type text/plain.
The default setting for plain text is
[HandlerSettings] MetaDataExtractor[text/plain]=plaintext
PDF files are handled using external programs which returns the content of the PDF file as plain text. This has been tested with the pstotext und pdftotext programs, but should work with others as well.
The pstotext program can be found either on freshmeat.net or the pstotext homepage.
http://freshmeat.net/projects/pstotext/?topic_id=849
http://research.compaq.com/SRC/virtualpaper/pstotext.html
The default settings for using pstotext are
[HandlerSettings] MetaDataExtractor[application/pdf]=pdf [PDFHandlerSettings] TextExtractionTool=pstotext
Another option for indexing PDF files is pdftotext from the xpdf project. Read more about that here.
MS word documents are also handled using external programs. This feature requires the wv program to work properly.
The word view program can be found either on freshmeat.net or the wv homepage.
http://freshmeat.net/projects/wv/?topic_id=849
http://wvWare.sourceforge.net/
The default settings for using wv are
[HandlerSettings] MetaDataExtractor[application/msword]=word [WordHandlerSettings] TextExtractionTool=wvWare -x /usr/local/wv/wvText.xml
You may consides these alternative indexers:
For XLHtml, you will need a html->Text-Converter. You can use a webbrowser like lynx or w3m for this.
As soon as we have set this up completly (on ez 3.2) and integrated everything with ez3.2, i will give more detailled info.
For problems with large files, see http://ez.no/developer/exponential_3/bug_reports/weird_search_limitations_binary_file
Note, that indexing relies on correct MIME-Types. As of 3.2-3, the MIME-Type for MS Excel-Files was not set.
Log in or create a user account to comment.
Comments
PowerPoint extraction
Brendan Pike
Thursday 29 April 2004 7:46:16 am