On Thu, Sep 8, 2016 at 1:22 AM Dr. Hirn drhirn@gmail.com wrote:
Hi Chad,
So Cirrus will index file contents for which we have a media handler defined. Right now, Pdf and Djvu files have specific media handlers that can
extract
their text contents.
Do I have to configure something more? My uploaded pdf don't get indexed.
The relevant lines in my LocalSettings.php:
wfLoadExtension( 'Elastica' ); require_once "$IP/extensions/CirrusSearch/CirrusSearch.php"; $wgCirrusSearchServers = array('xxx.xxx.xxx.xxx'); $wgSearchType = 'CirrusSearch';
Do you have the PdfHandler extension installed as well? If that's installed then this should Just Work without any additional configuration. Unless something has changed recently....
If you have an additional media type you want to extract text from,
that's
what would need implementing.
Any hints on that?
Sure. We've got a class in MediaWiki called ImageHandler. Media types that require special handling have a subclass of that. Here's the ones for PDF and DjVu for example:
https://phabricator.wikimedia.org/diffusion/EPHD/browse/master/PdfHandler_bo... https://phabricator.wikimedia.org/diffusion/MW/browse/master/includes/media/...
If you wanted to index, say, Word documents, you'd need a similar class in an extension to provide that support (there might be an extension for word docs already, I dunno).
-Chad