Dave Sigafoos:
I have stated that I want to be able to index other types of documents. That just having and IMAGE type, while probably fine in the beginning and probably find for the majority of use, might be *limiting*. I believe that these 2 things are still there so that is what I am looking at.
I don't think you're quite getting the problem here. As I said before, identifying the type is not the problem. The fact that the Image: namespace is called Image: is totally irrelevant. There is no "IMAGE" type in MediaWiki, in the sense in which I think you mean it; there's just a place where uploaded files of *all* types get stored, and it happens to be called "Image:".
Right now, if you upload a Word or PPT document, we can easily identify the type, either by running something like Unix's "file" on it, or simply by looking at the suffix (.doc, .ppt, etc.) of the filename. But as I stated before, there are two reasons why we cannot search on these documents, and neither of them has anything to do with identifying their type.
1. Right now, MediaWiki's searches are implemented by using the MySQL search feature. Unlike regular Wiki pages, uploaded documents do not go into MySQL, and therefore cannot be searched in this way.
2. You can only search for a text string in a file if the file is in a format you understand. That means writing a custom decoder for every format you want to handle. For plain text files, this is trivial, but for other files, it is not. For example, here's an excerpt from a PDF file:
ZÊü'¾h‹žu;ý,½‰Ÿóƒ‡·1oµ÷¯ù/g>QÁP€$IabþƺžUБ›ÈsF± _ÚæcçY~W¬%ó?åÈC‹›œ¯¼ÐO8 Ccý"š0~”œ%ò<GÃyì ÉeºœŸ¸ c+±j5[©J²WW ŒýDDÑ)Dp
Do you see the words "Inspection of clamps and of flexible piping" in there? They're in there.... somewhere. Simply knowing that this is a PDF doesn't help us to find them, though. Like it or not, formats like Word and Excel are proprietary, and that does make writing third-party tools for them harder.
So, in an attempt to take the heat out of this and get to facts, what I think you're looking for is an extension to: 1. allow admins to configure decoders for specific document types 2. run the right decoder (if any) when a document is uploaded 3. add the resulting plain text to the "searchindex" table. You would then have to find, install and configure decoders for your most-used document types.
I haven't had time to look into this in detail, but maybe more knowledgeable folk here could comment on whether this strategy makes sense; or, indeed, whether something like this already exists or is in progress.
Ian