Hi,
I want uploaded files to be searchable. For that purpose I have installed:
- MediaWiki 1.27 - Elasticsearch 1.7.5 - Extension CirrusSearch 1.27 - Extension Elastica (master)
The search within wiki-pages is working. But I can not search for text within uploaded files. Somehow I have the feeling, I'm missing something (e.g. Tika, ...) but I have no idea what and where to find more information.
Do you?
Thanks a lot! Stefan
On Wed, Sep 7, 2016 at 4:06 AM Dr. Hirn drhirn@gmail.com wrote:
Hi,
I want uploaded files to be searchable. For that purpose I have installed:
- MediaWiki 1.27
- Elasticsearch 1.7.5
- Extension CirrusSearch 1.27
- Extension Elastica (master)
The search within wiki-pages is working. But I can not search for text within uploaded files. Somehow I have the feeling, I'm missing something (e.g. Tika, ...) but I have no idea what and where to find more information.
Do you?
So Cirrus will index file contents for which we have a media handler defined. Right now, Pdf and Djvu files have specific media handlers that can extract their text contents.
If you have an additional media type you want to extract text from, that's what would need implementing.
-Chad
Hi Chad,
So Cirrus will index file contents for which we have a media handler defined. Right now, Pdf and Djvu files have specific media handlers that can extract their text contents.
Do I have to configure something more? My uploaded pdf don't get indexed.
The relevant lines in my LocalSettings.php:
wfLoadExtension( 'Elastica' ); require_once "$IP/extensions/CirrusSearch/CirrusSearch.php"; $wgCirrusSearchServers = array('xxx.xxx.xxx.xxx'); $wgSearchType = 'CirrusSearch';
If you have an additional media type you want to extract text from, that's what would need implementing.
Any hints on that?
Thx Stefan
On Thu, Sep 8, 2016 at 1:22 AM Dr. Hirn drhirn@gmail.com wrote:
Hi Chad,
So Cirrus will index file contents for which we have a media handler defined. Right now, Pdf and Djvu files have specific media handlers that can
extract
their text contents.
Do I have to configure something more? My uploaded pdf don't get indexed.
The relevant lines in my LocalSettings.php:
wfLoadExtension( 'Elastica' ); require_once "$IP/extensions/CirrusSearch/CirrusSearch.php"; $wgCirrusSearchServers = array('xxx.xxx.xxx.xxx'); $wgSearchType = 'CirrusSearch';
Do you have the PdfHandler extension installed as well? If that's installed then this should Just Work without any additional configuration. Unless something has changed recently....
If you have an additional media type you want to extract text from,
that's
what would need implementing.
Any hints on that?
Sure. We've got a class in MediaWiki called ImageHandler. Media types that require special handling have a subclass of that. Here's the ones for PDF and DjVu for example:
https://phabricator.wikimedia.org/diffusion/EPHD/browse/master/PdfHandler_bo... https://phabricator.wikimedia.org/diffusion/MW/browse/master/includes/media/...
If you wanted to index, say, Word documents, you'd need a similar class in an extension to provide that support (there might be an extension for word docs already, I dunno).
-Chad
Do you have the PdfHandler extension installed as well? If that's installed then this should Just Work without any additional configuration. Unless something has changed recently....
Yes, PdfHandler is installed.
Sure. We've got a class in MediaWiki called ImageHandler. Media types that require special handling have a subclass of that. Here's the ones for PDF and DjVu for example:
https://phabricator.wikimedia.org/diffusion/EPHD/browse/master/PdfHandler_bo... https://phabricator.wikimedia.org/diffusion/MW/browse/master/includes/media/...
If you wanted to index, say, Word documents, you'd need a similar class in an extension to provide that support (there might be an extension for word docs already, I dunno).
Ok, will have a look at that.
Thanks all! Stefan
mediawiki-l@lists.wikimedia.org