-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Moin,
On Monday 09 April 2007 20:59:36 Joshua Yeidel wrote:
The approach suggested by Ian Smith below is the one adopted by a couple of systems I work with. For example, Microsoft SharePoint 2007 uses "iFilters" for each document type to extract indexable information. Mac OS X's "Spotlight" feature also has per-filetype "importers" for extracting indexable text.
So the concept is not unworkable, but it seems to me to be a stretch for MediaWiki. MW is a Wiki-Page Management Systemâ„¢ <grin>, at which it excels; it's not a very good Document Management System, which is where Dave Sigafoos is apparently being driven (perhaps in slow stages) by his users.
Perhaps Dave should investigate other document management approaches and a metasearch engine to search across multiple systems.
Well, or write an extension that implements his idea, e.g.:
* upon upload, run the document through an index-generator (per file type) * add that index-text to some searchable index, or store it in the mysql
for each file type, you can do something like:
pdftotext $pdf_file exif $image_file etc.
There are very very probably filters for doc, ppt, xls, etc. If not, one can always whip one up with a Perl module (I know there exist modules for office and excel) While these might not get all the formatting etc, they will be able to extract the bulk (if not all) of the text and you can then easily index & search this text.
Another option would be to just let the webserver handle this, by running htdig (or google appliance?) over the uploaded files (which end up in wiki/images, anyway) and present the user with a search box to search all these files. The second option wouldn't integrate with mediawiki that nicely, tho.
All the best,
Tels
- -- Signed on Mon Apr 9 23:34:55 2007 with key 0x93B84C15. Get one of my photo posters: http://bloodgate.com/posters PGP key on http://bloodgate.com/tels.asc or per email.
"One man in a thousand is a leader of men, the other 999 follow women"
-- Groucho Marx