Re: [Mediawiki-l] Storing or Linking Documents

9 Apr 2007

Dave Sigafoos:
...

 I have stated that I want to be able to index other types of documents.
 That just having and IMAGE type, while probably fine in the beginning
 and probably find for the majority of use, might be *limiting*.  I
 believe that these 2 things are still there so that is what I am looking
 at.

I don't think you're quite getting the problem here.  As I said before,
identifying the type is not the problem.  The fact that the Image: namespace is called
Image: is totally irrelevant.  There is no "IMAGE" type in MediaWiki, in the
sense in which I think you mean it; there's just a place where uploaded files of *all*
types get stored, and it happens to be called "Image:".

Right now, if you upload a Word or PPT document, we can easily identify the type, either
by running something like Unix's "file" on it, or simply by looking at the
suffix (.doc, .ppt, etc.) of the filename.  But as I stated before, there are two reasons
why we cannot search on these documents, and neither of them has anything to do with
identifying their type.

1.  Right now, MediaWiki's searches are implemented by using the MySQL search feature.
 Unlike regular Wiki pages, uploaded documents do not go into MySQL, and therefore cannot
be searched in this way.

2.  You can only search for a text string in a file if the file is in a format you
understand.  That means writing a custom decoder for every format you want to handle.  For
plain text files, this is trivial, but for other files, it is not.  For example,
here's an excerpt from a PDF file:

    ZÊü'¾h‹žu;ý,½‰Ÿóƒ‡·1oµ÷¯ù/g>QÁP€$IabþÆºžUÐ‘›ÈsF±
    _ÚæcçY~W¬%ó?åÈC‹›œ¯¼ÐO8
    Ccý"š0~”œ%ò<GÃyì
ÉeºœŸ¸
    c+±j5[©J²WW	ŒýDDÑ)Dp

Do you see the words "Inspection of clamps and of flexible piping" in there? 
They're in there.... somewhere.  Simply knowing that this is a PDF doesn't help us
to find them, though.  Like it or not, formats like Word and Excel are proprietary, and
that does make writing third-party tools for them harder.

So, in an attempt to take the heat out of this and get to facts, what I think you're
looking for is an extension to:
1. allow admins to configure decoders for specific document types
2. run the right decoder (if any) when a document is uploaded
3. add the resulting plain text to the "searchindex" table.
You would then have to find, install and configure decoders for your most-used document
types.

I haven't had time to look into this in detail, but maybe more knowledgeable folk here
could comment on whether this strategy makes sense; or, indeed, whether something like
this already exists or is in progress.

Ian

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

Re: [Mediawiki-l] Storing or Linking Documents