Re: [Mediawiki-l] Storing or Linking Documents

9 Apr 2007

The approach suggested by Ian Smith below is the one adopted by a couple of
systems I work with.  For example, Microsoft SharePoint 2007 uses "iFilters"
for each document type to extract indexable information.  Mac OS X's
"Spotlight" feature also has per-filetype "importers" for extracting
indexable text.

So the concept is not unworkable, but it seems to me to be a stretch for
MediaWiki.  MW is a Wiki-Page Management System™ <grin>, at which it excels;
it's not a very good Document Management System, which is where Dave
Sigafoos is apparently being driven (perhaps in slow stages) by his users.

Perhaps Dave should investigate other document management approaches and a
metasearch engine to search across multiple systems.

My $.02...

-- Joshua

On 4/9/07 11:29 AM, "Ian Smith" &lt;ismith(a)good.com&gt; wrote:

...

 Dave Sigafoos:

 I have stated that I want to be able to index other types of documents.
 That just having and IMAGE type, while probably fine in the beginning
 and probably find for the majority of use, might be *limiting*.  I
 believe that these 2 things are still there so that is what I am looking
 at.  
 I don't think you're quite getting the problem here.  As I said before,
 identifying the type is not the problem.  The fact that the Image: namespace
 is called Image: is totally irrelevant.  There is no "IMAGE" type in
 MediaWiki, in the sense in which I think you mean it; there's just a place
 where uploaded files of *all* types get stored, and it happens to be called
 "Image:".

 Right now, if you upload a Word or PPT document, we can easily identify the
 type, either by running something like Unix's "file" on it, or simply by
 looking at the suffix (.doc, .ppt, etc.) of the filename.  But as I stated
 before, there are two reasons why we cannot search on these documents, and
 neither of them has anything to do with identifying their type.

 1.  Right now, MediaWiki's searches are implemented by using the MySQL search
 feature.  Unlike regular Wiki pages, uploaded documents do not go into MySQL,
 and therefore cannot be searched in this way.

 2.  You can only search for a text string in a file if the file is in a format
 you understand.  That means writing a custom decoder for every format you want
 to handle.  For plain text files, this is trivial, but for other files, it is
 not.  For example, here's an excerpt from a PDF file:

     ZÊü'¾h‹žu;ý,½‰Ÿóƒ‡·1oµ÷¯ù/g>QÁP€$IabþÆºžUÐ‘›ÈsF±
     _ÚæcçY~W¬%ó?åÈC‹›œ¯¼ÐO8
     Ccý"š0~”œ%ò<GÃyì
 ÉeºœŸ¸
     c+±j5[©J²WW ŒýDDÑ)Dp

 Do you see the words "Inspection of clamps and of flexible piping" in there?
 They're in there.... somewhere.  Simply knowing that this is a PDF doesn't
 help us to find them, though.  Like it or not, formats like Word and Excel are
 proprietary, and that does make writing third-party tools for them harder.

 So, in an attempt to take the heat out of this and get to facts, what I think
 you're looking for is an extension to:
 1. allow admins to configure decoders for specific document types
 2. run the right decoder (if any) when a document is uploaded
 3. add the resulting plain text to the "searchindex" table.
 You would then have to find, install and configure decoders for your most-used
 document types.

 I haven't had time to look into this in detail, but maybe more knowledgeable
 folk here could comment on whether this strategy makes sense; or, indeed,
 whether something like this already exists or is in progress.

 Ian

 _______________________________________________ MediaWiki-l mailing
...
  list MediaWiki-l(a)lists.wikimedia.org
http://lists.wikimedia.org/mailman/listin
...
  fo/mediawiki-l 

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

Re: [Mediawiki-l] Storing or Linking Documents