The approach suggested by Ian Smith below is the one adopted by a couple of
systems I work with. For example, Microsoft SharePoint 2007 uses "iFilters"
for each document type to extract indexable information. Mac OS X's
"Spotlight" feature also has per-filetype "importers" for extracting
indexable text.
So the concept is not unworkable, but it seems to me to be a stretch for
MediaWiki. MW is a Wiki-Page Management System™ <grin>, at which it excels;
it's not a very good Document Management System, which is where Dave
Sigafoos is apparently being driven (perhaps in slow stages) by his users.
Perhaps Dave should investigate other document management approaches and a
metasearch engine to search across multiple systems.
My $.02...
-- Joshua
On 4/9/07 11:29 AM, "Ian Smith" <ismith(a)good.com> wrote:
Dave Sigafoos:
I have stated that I want to be able to index other types of documents.
That just having and IMAGE type, while probably fine in the beginning
and probably find for the majority of use, might be *limiting*. I
believe that these 2 things are still there so that is what I am looking
at.
I don't think you're quite getting the problem here. As I said before,
identifying the type is not the problem. The fact that the Image: namespace
is called Image: is totally irrelevant. There is no "IMAGE" type in
MediaWiki, in the sense in which I think you mean it; there's just a place
where uploaded files of *all* types get stored, and it happens to be called
"Image:".
Right now, if you upload a Word or PPT document, we can easily identify the
type, either by running something like Unix's "file" on it, or simply by
looking at the suffix (.doc, .ppt, etc.) of the filename. But as I stated
before, there are two reasons why we cannot search on these documents, and
neither of them has anything to do with identifying their type.
1. Right now, MediaWiki's searches are implemented by using the MySQL search
feature. Unlike regular Wiki pages, uploaded documents do not go into MySQL,
and therefore cannot be searched in this way.
2. You can only search for a text string in a file if the file is in a format
you understand. That means writing a custom decoder for every format you want
to handle. For plain text files, this is trivial, but for other files, it is
not. For example, here's an excerpt from a PDF file:
ZÊü'¾h‹žu;ý,½‰Ÿóƒ‡·1oµ÷¯ù/g>QÁP€$IabþƺžUБ›ÈsF±
_ÚæcçY~W¬%ó?åÈC‹›œ¯¼ÐO8
Ccý"š0~”œ%ò<GÃyì
ÉeºœŸ¸
c+±j5[©J²WW ŒýDDÑ)Dp
Do you see the words "Inspection of clamps and of flexible piping" in there?
They're in there.... somewhere. Simply knowing that this is a PDF doesn't
help us to find them, though. Like it or not, formats like Word and Excel are
proprietary, and that does make writing third-party tools for them harder.
So, in an attempt to take the heat out of this and get to facts, what I think
you're looking for is an extension to:
1. allow admins to configure decoders for specific document types
2. run the right decoder (if any) when a document is uploaded
3. add the resulting plain text to the "searchindex" table.
You would then have to find, install and configure decoders for your most-used
document types.
I haven't had time to look into this in detail, but maybe more knowledgeable
folk here could comment on whether this strategy makes sense; or, indeed,
whether something like this already exists or is in progress.
Ian
_______________________________________________