[Mediawiki-l] Storing or Linking Documents

Mon Apr 9 20:59:36 UTC 2007

The approach suggested by Ian Smith below is the one adopted by a couple of
systems I work with.  For example, Microsoft SharePoint 2007 uses "iFilters"
for each document type to extract indexable information.  Mac OS X's
"Spotlight" feature also has per-filetype "importers" for extracting
indexable text.

So the concept is not unworkable, but it seems to me to be a stretch for
MediaWiki.  MW is a Wiki-Page Management System™ <grin>, at which it excels;
it's not a very good Document Management System, which is where Dave
Sigafoos is apparently being driven (perhaps in slow stages) by his users.

Perhaps Dave should investigate other document management approaches and a
metasearch engine to search across multiple systems.

My $.02...

-- Joshua

On 4/9/07 11:29 AM, "Ian Smith" <ismith at good.com> wrote:

> 
> Dave Sigafoos:
>> 
>> I have stated that I want to be able to index other types of documents.
>> That just having and IMAGE type, while probably fine in the beginning
>> and probably find for the majority of use, might be *limiting*.  I
>> believe that these 2 things are still there so that is what I am looking
>> at.
> 
> I don't think you're quite getting the problem here.  As I said before,
> identifying the type is not the problem.  The fact that the Image: namespace
> is called Image: is totally irrelevant.  There is no "IMAGE" type in
> MediaWiki, in the sense in which I think you mean it; there's just a place
> where uploaded files of *all* types get stored, and it happens to be called
> "Image:".
> 
> Right now, if you upload a Word or PPT document, we can easily identify the
> type, either by running something like Unix's "file" on it, or simply by
> looking at the suffix (.doc, .ppt, etc.) of the filename.  But as I stated
> before, there are two reasons why we cannot search on these documents, and
> neither of them has anything to do with identifying their type.
> 
> 1.  Right now, MediaWiki's searches are implemented by using the MySQL search
> feature.  Unlike regular Wiki pages, uploaded documents do not go into MySQL,
> and therefore cannot be searched in this way.
> 
> 2.  You can only search for a text string in a file if the file is in a format
> you understand.  That means writing a custom decoder for every format you want
> to handle.  For plain text files, this is trivial, but for other files, it is
> not.  For example, here's an excerpt from a PDF file:
> 
>     Z
Êü'¾h‹žu;ý,½‰Ÿóƒ‡·1oµ÷¯ù/g>QÁP€$IabþÆºžUÐ‘›ÈsF±
>     _ÚæcçY~W¬%ó?åÈC‹›œ¯¼ÐO8
>     Ccý"š0~”œ%ò<GÃyì
> ÉeºœŸ¸
>     c+±j5[©J²WW ŒýDDÑ)Dp
> 
> Do you see the words "Inspection of clamps and of flexible piping" in there?
> They're in there.... somewhere.  Simply knowing that this is a PDF doesn't
> help us to find them, though.  Like it or not, formats like Word and Excel are
> proprietary, and that does make writing third-party tools for them harder.
> 
> So, in an attempt to take the heat out of this and get to facts, what I think
> you're looking for is an extension to:
> 1. allow admins to configure decoders for specific document types
> 2. run the right decoder (if any) when a document is uploaded
> 3. add the resulting plain text to the "searchindex" table.
> You would then have to find, install and configure decoders for your most-used
> document types.
> 
> I haven't had time to look into this in detail, but maybe more knowledgeable
> folk here could comment on whether this strategy makes sense; or, indeed,
> whether something like this already exists or is in progress.
> 
> Ian
> 
> _______________________________________________
MediaWiki-l mailing
> list
MediaWiki-l at lists.wikimedia.org
http://lists.wikimedia.org/mailman/listin
> fo/mediawiki-l