Dave Sigafoos:
I have stated that I want to be able to index other types of documents.
That just having and IMAGE type, while probably fine in the beginning
and probably find for the majority of use, might be *limiting*. I
believe that these 2 things are still there so that is what I am looking
at.
I don't think you're quite getting the problem here. As I said before,
identifying the type is not the problem. The fact that the Image: namespace is called
Image: is totally irrelevant. There is no "IMAGE" type in MediaWiki, in the
sense in which I think you mean it; there's just a place where uploaded files of *all*
types get stored, and it happens to be called "Image:".
Right now, if you upload a Word or PPT document, we can easily identify the type, either
by running something like Unix's "file" on it, or simply by looking at the
suffix (.doc, .ppt, etc.) of the filename. But as I stated before, there are two reasons
why we cannot search on these documents, and neither of them has anything to do with
identifying their type.
1. Right now, MediaWiki's searches are implemented by using the MySQL search feature.
Unlike regular Wiki pages, uploaded documents do not go into MySQL, and therefore cannot
be searched in this way.
2. You can only search for a text string in a file if the file is in a format you
understand. That means writing a custom decoder for every format you want to handle. For
plain text files, this is trivial, but for other files, it is not. For example,
here's an excerpt from a PDF file:
ZÊü'¾h‹žu;ý,½‰Ÿóƒ‡·1oµ÷¯ù/g>QÁP€$IabþƺžUБ›ÈsF±
_ÚæcçY~W¬%ó?åÈC‹›œ¯¼ÐO8
Ccý"š0~”œ%ò<GÃyì
ÉeºœŸ¸
c+±j5[©J²WW ŒýDDÑ)Dp
Do you see the words "Inspection of clamps and of flexible piping" in there?
They're in there.... somewhere. Simply knowing that this is a PDF doesn't help us
to find them, though. Like it or not, formats like Word and Excel are proprietary, and
that does make writing third-party tools for them harder.
So, in an attempt to take the heat out of this and get to facts, what I think you're
looking for is an extension to:
1. allow admins to configure decoders for specific document types
2. run the right decoder (if any) when a document is uploaded
3. add the resulting plain text to the "searchindex" table.
You would then have to find, install and configure decoders for your most-used document
types.
I haven't had time to look into this in detail, but maybe more knowledgeable folk here
could comment on whether this strategy makes sense; or, indeed, whether something like
this already exists or is in progress.
Ian