On Thu, Sep 8, 2016 at 1:22 AM Dr. Hirn <drhirn(a)gmail.com> wrote:
Hi Chad,
So Cirrus will index file contents for which we
have a media handler
defined.
Right now, Pdf and Djvu files have specific media handlers that can
extract
their text contents.
Do I have to configure something more? My uploaded pdf don't get indexed.
The relevant lines in my LocalSettings.php:
wfLoadExtension( 'Elastica' );
require_once "$IP/extensions/CirrusSearch/CirrusSearch.php";
$wgCirrusSearchServers = array('xxx.xxx.xxx.xxx');
$wgSearchType = 'CirrusSearch';
Do you have the PdfHandler extension installed as well? If that's installed
then this should Just Work without any additional configuration. Unless
something has changed recently....
If you have an
additional media type you want to extract text from,
that's
what
would need implementing.
Any hints on that?
Sure. We've got a class in MediaWiki called ImageHandler. Media types that
require special handling have a subclass of that. Here's the ones for PDF
and
DjVu for example:
https://phabricator.wikimedia.org/diffusion/EPHD/browse/master/PdfHandler_b…
https://phabricator.wikimedia.org/diffusion/MW/browse/master/includes/media…
If you wanted to index, say, Word documents, you'd need a similar class in
an extension
to provide that support (there might be an extension for word docs already,
I dunno).
-Chad