Identifying the type isn't the problem - that's easy. The problem is writing decoders for every document format in the world, and hacking them into the existing MySQL-based search system.
Having said that, if we had to-plaintext converters for key doc formats, couldn't we add that plaintext to the text indexed for the Image: page? This could happen at save, and Wiki admins could configure converters by doc suffix.
Of course, the true answer is still to browbeat our users into using wiki markup... ;-)
Ian
-----Original Message----- From: Dave Sigafoos [mailto:davesigafoos@sanmar.com] Sent: Saturday, April 07, 2007 06:45 PM Pacific Standard Time To: MediaWiki announcements and site admin list Subject: Re: [Mediawiki-l] Storing or Linking Documents
So how hard would it be to expand the upload process to allow selecting the 'type' of upload? Then the 'type' would be able to be searched thus adding a good benefit to MW.
Also, wouldn't it make sense, since the upload process has a 'comment' that you can enter, that a search against this comment be allowed. I do understand that searching on binary of an image really makes no sense (unless you are storing hidden text :) but allowing entry / search of keywords might be a good idea
Thanks.
DSig David Tod Sigafoos | SANMAR Corporation PICK Guy 206-770-5585 davesigafoos@sanmar.com
-----Original Message----- From: mediawiki-l-bounces@lists.wikimedia.org [mailto:mediawiki-l-bounces@lists.wikimedia.org] On Behalf Of Jim Wilson Sent: Friday, April 06, 2007 11:31 To: MediaWiki announcements and site admin list Subject: Re: [Mediawiki-l] Storing or Linking Documents
The Image: namespace stores the meta-data for all uploaded files; I guess the "Image" name is based on history and how it's used in WP.
But
for those of us using MW for corporate nets, "Image:" means any
uploaded
file.
AFAIK, the namespace is called "Image" because that's what it's meant to store - images. Not video, not Excel spreadsheets, not Word docs.
Using the Image upload facility for something other than pure images represents an intentional circumvention of the spirit of the device (regardless of business needs - which I understand).
For the record, we have a wiki here where I work, and yes, people upload Excel spreadsheets and word docs and PDFs and ZIP files and .... etc.
-- Jim
On 4/6/07, Ian Smith ismith@good.com wrote:
Dave Sigafoos:
I had gathered that images weren't searchable which makes sense to
me
(except for descriptive information) but I did not realize that a document with 'text' would not be searchable.
Documents are simply stored as-is in the filesystem; so, an uploaded Word doc ends up stored in c:\WebServer\mediawiki\images\f\f7\foo.doc. In contrast, Wiki pages are stored as fields in the MySQL database.
Search doesn't work on uploaded documents, because:
- the search uses the MySQL search facility, and so only works on
stuff
which is in the DB 2. since an uploaded doc could be in any format, there's no way to search it: eg. if a document compresses its content using some proprietary scheme, there's no general way to look inside it.
Note that the problems go beyond search: features like "What links
here"
only work for links from Wiki pages, etc.
I do see now that it seems to put all uploaded 'media' to IMAGE:
which
I
am not sure I understand.
The Image: namespace stores the meta-data for all uploaded files; I guess the "Image" name is based on history and how it's used in WP.
But
for those of us using MW for corporate nets, "Image:" means any
uploaded
file.
Believe me, I feel your pain... if you find a way to stop your users using Word for a single sentence of plain text, let me know. ;-)
Ian
MediaWiki-l mailing list MediaWiki-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/mediawiki-l
_______________________________________________ MediaWiki-l mailing list MediaWiki-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/mediawiki-l
_______________________________________________ MediaWiki-l mailing list MediaWiki-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/mediawiki-l
On 08/04/07, Ian Smith ismith@good.com wrote:
Having said that, if we had to-plaintext converters for key doc formats, couldn't we add that plaintext to the text indexed for the Image: page? This could happen at save, and Wiki admins could configure converters by doc suffix.
Antiword generates text from .doc reasonably quickly - that could be put into the indexing pile.
My thought was that if we have the ability to add 'types' we could then define extensions to work with that 'type'.
There are a couple formats that seem to be 'universal' whether we like it or not. Word, excel, ppt, pdf then emerging standards like those coming from open office.
Also code documents, php, html, c and xml etc could be stored.
And it isn't so much " .. The problem is writing decoders for every document format in the world .." as a couple of the standards. For example, in the environment that I work in I have written several api examples to connect to a different database. Once there are a couple good examples are there then some will be able to duplicate the process.
Also I am not sure how many 'decoders' would be needed. For example a word document should be able to be searched without decoding it to plain text. Yes? Maybe not.
" .. couldn't we add that plaintext to the text indexed for the Image: page ..". This would work, but wouldn't it make more sense to have definitions of the 'document type'.
I realize that this is more than wiki was intended but MW is such an incredible 'product' that I can see people using it more and more for their business use.
Of course not all tools should be used for all situations. It just *seems* to me that document/documentation search/retrieval is a close fit.
Thanks for the follow up
DSig David Tod Sigafoos | SANMAR Corporation PICK Guy 206-770-5585 davesigafoos@sanmar.com
-----Original Message----- From: mediawiki-l-bounces@lists.wikimedia.org [mailto:mediawiki-l-bounces@lists.wikimedia.org] On Behalf Of Ian Smith Sent: Sunday, April 08, 2007 8:30 To: MediaWiki announcements and site admin list; MediaWiki announcements and site admin list Subject: Re: [Mediawiki-l] Storing or Linking Documents
Identifying the type isn't the problem - that's easy. The problem is writing decoders for every document format in the world, and hacking them into the existing MySQL-based search system.
Having said that, if we had to-plaintext converters for key doc formats, This could happen at save, and Wiki admins could configure converters by doc suffix.
Of course, the true answer is still to browbeat our users into using wiki markup... ;-)
Ian
-----Original Message----- From: Dave Sigafoos [mailto:davesigafoos@sanmar.com] Sent: Saturday, April 07, 2007 06:45 PM Pacific Standard Time To: MediaWiki announcements and site admin list Subject: Re: [Mediawiki-l] Storing or Linking Documents
So how hard would it be to expand the upload process to allow selecting the 'type' of upload? Then the 'type' would be able to be searched thus adding a good benefit to MW.
Also, wouldn't it make sense, since the upload process has a 'comment' that you can enter, that a search against this comment be allowed. I do understand that searching on binary of an image really makes no sense (unless you are storing hidden text :) but allowing entry / search of keywords might be a good idea
Thanks.
DSig David Tod Sigafoos | SANMAR Corporation PICK Guy 206-770-5585 davesigafoos@sanmar.com
-----Original Message----- From: mediawiki-l-bounces@lists.wikimedia.org [mailto:mediawiki-l-bounces@lists.wikimedia.org] On Behalf Of Jim Wilson Sent: Friday, April 06, 2007 11:31 To: MediaWiki announcements and site admin list Subject: Re: [Mediawiki-l] Storing or Linking Documents
The Image: namespace stores the meta-data for all uploaded files; I guess the "Image" name is based on history and how it's used in WP.
But
for those of us using MW for corporate nets, "Image:" means any
uploaded
file.
AFAIK, the namespace is called "Image" because that's what it's meant to store - images. Not video, not Excel spreadsheets, not Word docs.
Using the Image upload facility for something other than pure images represents an intentional circumvention of the spirit of the device (regardless of business needs - which I understand).
For the record, we have a wiki here where I work, and yes, people upload Excel spreadsheets and word docs and PDFs and ZIP files and .... etc.
-- Jim
On 4/6/07, Ian Smith ismith@good.com wrote:
Dave Sigafoos:
I had gathered that images weren't searchable which makes sense to
me
(except for descriptive information) but I did not realize that a document with 'text' would not be searchable.
Documents are simply stored as-is in the filesystem; so, an uploaded Word doc ends up stored in c:\WebServer\mediawiki\images\f\f7\foo.doc. In contrast, Wiki pages are stored as fields in the MySQL database.
Search doesn't work on uploaded documents, because:
- the search uses the MySQL search facility, and so only works on
stuff
which is in the DB 2. since an uploaded doc could be in any format, there's no way to search it: eg. if a document compresses its content using some proprietary scheme, there's no general way to look inside it.
Note that the problems go beyond search: features like "What links
here"
only work for links from Wiki pages, etc.
I do see now that it seems to put all uploaded 'media' to IMAGE:
which
I
am not sure I understand.
The Image: namespace stores the meta-data for all uploaded files; I guess the "Image" name is based on history and how it's used in WP.
But
for those of us using MW for corporate nets, "Image:" means any
uploaded
file.
Believe me, I feel your pain... if you find a way to stop your users using Word for a single sentence of plain text, let me know. ;-)
Ian
MediaWiki-l mailing list MediaWiki-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/mediawiki-l
_______________________________________________ MediaWiki-l mailing list MediaWiki-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/mediawiki-l
_______________________________________________ MediaWiki-l mailing list MediaWiki-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/mediawiki-l
_______________________________________________ MediaWiki-l mailing list MediaWiki-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/mediawiki-l
Dave Sigafoos wrote:
There are a couple formats that seem to be 'universal' whether we like it or not. Word, excel, ppt, pdf then emerging standards like those coming from open office.
Word - for one example - is a "standard" that is changed by MS every so often for no apparent purpose other than to screw the competition. The latest Office 2007 version apparently breaks a lot of other products because of changes to the "standard".
Unless the standard is open, published, unrestricted and not under the control of a single commercial entity, it is not a standard - it is proprietary. You may like the idea of chasing a moving, camouflaged target, but I don't.
Supporting an open standard and encouraging companies like MS to allow export of their products' files in the open standard format as an alternative to their own is a better solution. MS already exports to RTF, txt and other formats. Adding an open format shouldn't be a big deal.
Mike
Michael,
Thanks for the reply .. a couple notes
".. Word - for one example - is a "standard" that is changed by MS every so often for no apparent purpose other than to screw the competition .. "
I get it. You are a MS hater and that is cool. I am not that fond of them either which is why we use OpenOffice in our business and on my systems at home. But the simple fact is that there are millions of users. Some of them are clients.
" Unless the standard is open, published, unrestricted and not under the
control of a single commercial entity, it is not a standard - it is proprietary. You may like the idea of chasing a moving, camouflaged target, but I don't."
Never said it was a standard, I said it was 'universal' which, of course, is a very different thing. Also I never suggested that '.. chasing a moving, camouflaged target ..'. Unless MS removed every 'text' word from their document I don't see where an extension that could index the words would be a moving target. Now possibly MW has been designed in a way that would make this not possible, I haven't gotten that far. But I have written, other environments, methods of scanning word docs.
"Supporting an open standard and encouraging companies like MS to allow export of their products' files in the open standard format as an alternative to their own is a better solution. MS already exports to RTF, txt and other formats. Adding an open format shouldn't be a big deal."
Do you think that MS gives a crap whether MW is or is not capable of indexing word documents? Unless, of course you feel able to calling Bill and having him change his format.
I believe that the market place will decide, rightly or wrongly (and who decides that?), what tools and "standards" will be used. It is my job as a consultant to give the user the best information and let them decide. Then either support the client in their decision or go find another client.
Of course right now it doesn't really matter as the only TYPE is IMAGE and MW doesn't appear to be able to search on it (which makes sense if the only type I wanted to store was IMAGE.
Thanks for the note. It is much like discussions I have with Linux friends <G>
Dave Sigafoos wrote:
There are a couple formats that seem to be 'universal' whether we like it or not. Word, excel, ppt, pdf then emerging standards like those coming from open office.
Word - for one example - is a "standard" that is changed by MS every so often for no apparent purpose other than to screw the competition. The latest Office 2007 version apparently breaks a lot of other products because of changes to the "standard".
Unless the standard is open, published, unrestricted and not under the control of a single commercial entity, it is not a standard - it is proprietary. You may like the idea of chasing a moving, camouflaged target, but I don't.
Supporting an open standard and encouraging companies like MS to allow export of their products' files in the open standard format as an alternative to their own is a better solution. MS already exports to RTF, txt and other formats. Adding an open format shouldn't be a big deal.
Mike
_______________________________________________ MediaWiki-l mailing list MediaWiki-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/mediawiki-l
On 09/04/07, Dave Sigafoos davesigafoos@sanmar.com wrote:
I get it. You are a MS hater and that is cool. I am not that fond of them either which is why we use OpenOffice in our business and on my systems at home. But the simple fact is that there are millions of users. Some of them are clients.
No, it's true - Microsoft are known to keep changing their proprietary formats, and it is fair to state that this interferes with anybody trying to interpret them for any purpose, legitimate or otherwise.
Of course right now it doesn't really matter as the only TYPE is IMAGE and MW doesn't appear to be able to search on it (which makes sense if the only type I wanted to store was IMAGE.
David, from what I gather, you keep stating that we can't "search on (images)" but you haven't said what you'd actually like the software to be able to do. It may be possible to add some sort of extensible mechanism for MediaWiki to be able to index uploads in some fashion; if that's what you want, then endorse it.
All I can see is a bunch of "MediaWiki can't even do this, it sucks" type comments from various quarters, none of which are making the problem clear enough for us to understand.
Rob Church
On 09/04/07, Rob Church robchur@gmail.com wrote:
On 09/04/07, Dave Sigafoos davesigafoos@sanmar.com wrote:
I get it. You are a MS hater and that is cool. I am not that fond of them either which is why we use OpenOffice in our business and on my systems at home. But the simple fact is that there are millions of users. Some of them are clients.
No, it's true - Microsoft are known to keep changing their proprietary formats, and it is fair to state that this interferes with anybody trying to interpret them for any purpose, legitimate or otherwise.
Of course right now it doesn't really matter as the only TYPE is IMAGE and MW doesn't appear to be able to search on it (which makes sense if the only type I wanted to store was IMAGE.
David, from what I gather, you keep stating that we can't "search on (images)" but you haven't said what you'd actually like the software to be able to do. It may be possible to add some sort of extensible mechanism for MediaWiki to be able to index uploads in some fashion; if that's what you want, then endorse it.
All I can see is a bunch of "MediaWiki can't even do this, it sucks" type comments from various quarters, none of which are making the problem clear enough for us to understand.
Rob Church
MediaWiki-l mailing list MediaWiki-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/mediawiki-l
"No, it's true - Microsoft are .."
Yes it is true and I never suggested otherwise, but to give the impression that nothing should be done to help those millions of users .. I just never understand this idea. And that is ok as we all come from different places.
"David, from what I gather, you keep stating that we can't "search on (images)" but you haven't said what you'd actually like the software to be able to do. It may be possible to add some sort of extensible mechanism for MediaWiki to be able to index uploads in some fashion; if that's what you want, then endorse it."
I have stated that I want to be able to index other types of documents. That just having and IMAGE type, while probably fine in the beginning and probably find for the majority of use, might be *limiting*. I believe that these 2 things are still there so that is what I am looking at.
I don't really have anything to endorse because I am still trying to gain information from this list and from examining code (extensions etc). Yes I believe there could be an extension that handles want I see as a missing feature BUT until I am knowledgeable enough in MW to say that specifically I think would be *wrong*.
I am not sure where you get the " .. MediaWiki can't even do this, it sucks .. '. I have said before and will say again for the audience that I think MediaWiki is great .. great .. GREAT.
Tell you what, I will duck my head back down and try to work bits out by my self without asking HOW or WHY as it might give the wrong impression. I will try to come up with a specific definition then post it and see if anyone can help me.
Thanks
DSig
-----Original Message----- From: mediawiki-l-bounces@lists.wikimedia.org [mailto:mediawiki-l-bounces@lists.wikimedia.org] On Behalf Of Rob Church Sent: Monday, April 09, 2007 9:39 To: MediaWiki announcements and site admin list Subject: Re: [Mediawiki-l] Storing or Linking Documents
On 09/04/07, Dave Sigafoos davesigafoos@sanmar.com wrote:
I get it. You are a MS hater and that is cool. I am not that fond of them either which is why we use OpenOffice in our business and on my systems at home. But the simple fact is that there are millions of users. Some of them are clients.
No, it's true - Microsoft are known to keep changing their proprietary formats, and it is fair to state that this interferes with anybody trying to interpret them for any purpose, legitimate or otherwise.
Of course right now it doesn't really matter as the only TYPE is IMAGE and MW doesn't appear to be able to search on it (which makes sense if the only type I wanted to store was IMAGE.
David, from what I gather, you keep stating that we can't "search on (images)" but you haven't said what you'd actually like the software to be able to do. It may be possible to add some sort of extensible mechanism for MediaWiki to be able to index uploads in some fashion; if that's what you want, then endorse it.
All I can see is a bunch of "MediaWiki can't even do this, it sucks" type comments from various quarters, none of which are making the problem clear enough for us to understand.
Rob Church
_______________________________________________ MediaWiki-l mailing list MediaWiki-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/mediawiki-l
Dave Sigafoos:
I have stated that I want to be able to index other types of documents. That just having and IMAGE type, while probably fine in the beginning and probably find for the majority of use, might be *limiting*. I believe that these 2 things are still there so that is what I am looking at.
I don't think you're quite getting the problem here. As I said before, identifying the type is not the problem. The fact that the Image: namespace is called Image: is totally irrelevant. There is no "IMAGE" type in MediaWiki, in the sense in which I think you mean it; there's just a place where uploaded files of *all* types get stored, and it happens to be called "Image:".
Right now, if you upload a Word or PPT document, we can easily identify the type, either by running something like Unix's "file" on it, or simply by looking at the suffix (.doc, .ppt, etc.) of the filename. But as I stated before, there are two reasons why we cannot search on these documents, and neither of them has anything to do with identifying their type.
1. Right now, MediaWiki's searches are implemented by using the MySQL search feature. Unlike regular Wiki pages, uploaded documents do not go into MySQL, and therefore cannot be searched in this way.
2. You can only search for a text string in a file if the file is in a format you understand. That means writing a custom decoder for every format you want to handle. For plain text files, this is trivial, but for other files, it is not. For example, here's an excerpt from a PDF file:
ZÊü'¾h‹žu;ý,½‰Ÿóƒ‡·1oµ÷¯ù/g>QÁP€$IabþƺžUБ›ÈsF± _ÚæcçY~W¬%ó?åÈC‹›œ¯¼ÐO8 Ccý"š0~”œ%ò<GÃyì ÉeºœŸ¸ c+±j5[©J²WW ŒýDDÑ)Dp
Do you see the words "Inspection of clamps and of flexible piping" in there? They're in there.... somewhere. Simply knowing that this is a PDF doesn't help us to find them, though. Like it or not, formats like Word and Excel are proprietary, and that does make writing third-party tools for them harder.
So, in an attempt to take the heat out of this and get to facts, what I think you're looking for is an extension to: 1. allow admins to configure decoders for specific document types 2. run the right decoder (if any) when a document is uploaded 3. add the resulting plain text to the "searchindex" table. You would then have to find, install and configure decoders for your most-used document types.
I haven't had time to look into this in detail, but maybe more knowledgeable folk here could comment on whether this strategy makes sense; or, indeed, whether something like this already exists or is in progress.
Ian
The approach suggested by Ian Smith below is the one adopted by a couple of systems I work with. For example, Microsoft SharePoint 2007 uses "iFilters" for each document type to extract indexable information. Mac OS X's "Spotlight" feature also has per-filetype "importers" for extracting indexable text.
So the concept is not unworkable, but it seems to me to be a stretch for MediaWiki. MW is a Wiki-Page Management System™ <grin>, at which it excels; it's not a very good Document Management System, which is where Dave Sigafoos is apparently being driven (perhaps in slow stages) by his users.
Perhaps Dave should investigate other document management approaches and a metasearch engine to search across multiple systems.
My $.02...
-- Joshua
On 4/9/07 11:29 AM, "Ian Smith" ismith@good.com wrote:
Dave Sigafoos:
I have stated that I want to be able to index other types of documents. That just having and IMAGE type, while probably fine in the beginning and probably find for the majority of use, might be *limiting*. I believe that these 2 things are still there so that is what I am looking at.
I don't think you're quite getting the problem here. As I said before, identifying the type is not the problem. The fact that the Image: namespace is called Image: is totally irrelevant. There is no "IMAGE" type in MediaWiki, in the sense in which I think you mean it; there's just a place where uploaded files of *all* types get stored, and it happens to be called "Image:".
Right now, if you upload a Word or PPT document, we can easily identify the type, either by running something like Unix's "file" on it, or simply by looking at the suffix (.doc, .ppt, etc.) of the filename. But as I stated before, there are two reasons why we cannot search on these documents, and neither of them has anything to do with identifying their type.
- Right now, MediaWiki's searches are implemented by using the MySQL search
feature. Unlike regular Wiki pages, uploaded documents do not go into MySQL, and therefore cannot be searched in this way.
- You can only search for a text string in a file if the file is in a format
you understand. That means writing a custom decoder for every format you want to handle. For plain text files, this is trivial, but for other files, it is not. For example, here's an excerpt from a PDF file:
ZÊü'¾h‹žu;ý,½‰Ÿóƒ‡·1oµ÷¯ù/g>QÁP€$IabþƺžUБ›ÈsF± _ÚæcçY~W¬%ó?åÈC‹›œ¯¼ÐO8 Ccý"š0~”œ%ò<GÃyì
ÉeºœŸ¸ c+±j5[©J²WW ŒýDDÑ)Dp
Do you see the words "Inspection of clamps and of flexible piping" in there? They're in there.... somewhere. Simply knowing that this is a PDF doesn't help us to find them, though. Like it or not, formats like Word and Excel are proprietary, and that does make writing third-party tools for them harder.
So, in an attempt to take the heat out of this and get to facts, what I think you're looking for is an extension to:
- allow admins to configure decoders for specific document types
- run the right decoder (if any) when a document is uploaded
- add the resulting plain text to the "searchindex" table.
You would then have to find, install and configure decoders for your most-used document types.
I haven't had time to look into this in detail, but maybe more knowledgeable folk here could comment on whether this strategy makes sense; or, indeed, whether something like this already exists or is in progress.
Ian
MediaWiki-l mailing
list
MediaWiki-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listin
fo/mediawiki-l
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Moin,
On Monday 09 April 2007 20:59:36 Joshua Yeidel wrote:
The approach suggested by Ian Smith below is the one adopted by a couple of systems I work with. For example, Microsoft SharePoint 2007 uses "iFilters" for each document type to extract indexable information. Mac OS X's "Spotlight" feature also has per-filetype "importers" for extracting indexable text.
So the concept is not unworkable, but it seems to me to be a stretch for MediaWiki. MW is a Wiki-Page Management System™ <grin>, at which it excels; it's not a very good Document Management System, which is where Dave Sigafoos is apparently being driven (perhaps in slow stages) by his users.
Perhaps Dave should investigate other document management approaches and a metasearch engine to search across multiple systems.
Well, or write an extension that implements his idea, e.g.:
* upon upload, run the document through an index-generator (per file type) * add that index-text to some searchable index, or store it in the mysql
for each file type, you can do something like:
pdftotext $pdf_file exif $image_file etc.
There are very very probably filters for doc, ppt, xls, etc. If not, one can always whip one up with a Perl module (I know there exist modules for office and excel) While these might not get all the formatting etc, they will be able to extract the bulk (if not all) of the text and you can then easily index & search this text.
Another option would be to just let the webserver handle this, by running htdig (or google appliance?) over the uploaded files (which end up in wiki/images, anyway) and present the user with a search box to search all these files. The second option wouldn't integrate with mediawiki that nicely, tho.
All the best,
Tels
- -- Signed on Mon Apr 9 23:34:55 2007 with key 0x93B84C15. Get one of my photo posters: http://bloodgate.com/posters PGP key on http://bloodgate.com/tels.asc or per email.
"One man in a thousand is a leader of men, the other 999 follow women"
-- Groucho Marx
On 10/04/07, Tels nospam-abuse@bloodgate.com wrote:
Well, or write an extension that implements his idea, e.g.:
* upon upload, run the document through an index-generator (per file type) * add that index-text to some searchable index, or store it in the mysql
I've thrown some very experimental initial work on such an extension into Subversion; trunk/extensions/FileSearch. It introduces a framework for defining "Extractors", which are capable of extracting indexable content from uploaded files, and adds this into the search index for the image page itself when a file is uploaded.
Searching seems to work (the reference implementation provided is for plain text file uploads, which are obviously straightforward), but result display leaves a lot to be desired. I anticipate rewriting the whole thing to provide cleaner, better-looking integration into MediaWiki at some point soonish.
Rob Church
I've thrown some very experimental initial work on such an extension into Subversion; trunk/extensions/FileSearch.
Sounds great, Rob! Unfortunately, I can't access SVN at the moment*, so I'll just have to take your word for it now...
* I keep getting "Forbidden. You were denied access because: Access denied by access control list." for the SVN Web interface, as well as for Meta and images on Wikipedia. I'll try again from home tonight.
add that index-text to some searchable index, or store it in the mysql
Maybe I'm missing something here, but why not just throw that "index-text" into the file's description field?
-- F.
On 10/04/07, Frederik Dohr FDG001@gmx.net wrote:
Maybe I'm missing something here, but why not just throw that "index-text" into the file's description field?
Do you really want a 200kb Word document transcribed into the page itself?
- d.
Maybe I'm missing something here, but why not just throw that
"index-text" into the file's description field?
Do you really want a 200kb Word document transcribed into the page itself?
It could always be enclosed in a "<DIV style='display: none;'>" wrapper...
However, loading times might indeed be an issue - though probably less so for intranet networks.
-- F.
Rob,
Thanks for all your hard work on this.
I took a look at the svn and was surprised at how *easy* it all looked. This tells me that there is a pretty good structure under it all for the MW engine.
Are there any documents other than found searching MW that describe the architecture of MW?
Thanks again. I look forward to being able to upload documents and not hear complaints <G>
DSig David Tod Sigafoos | SANMAR Corporation
-----Original Message----- From: mediawiki-l-bounces@lists.wikimedia.org [mailto:mediawiki-l-bounces@lists.wikimedia.org] On Behalf Of Rob Church Sent: Tuesday, April 10, 2007 2:14 To: MediaWiki announcements and site admin list Subject: Re: [Mediawiki-l] Storing or Linking Documents
On 10/04/07, Tels nospam-abuse@bloodgate.com wrote:
Well, or write an extension that implements his idea, e.g.:
* upon upload, run the document through an index-generator
(per file type)
* add that index-text to some searchable index, or store it in
the mysql
I've thrown some very experimental initial work on such an extension into Subversion; trunk/extensions/FileSearch. It introduces a framework for defining "Extractors", which are capable of extracting indexable content from uploaded files, and adds this into the search index for the image page itself when a file is uploaded.
Searching seems to work (the reference implementation provided is for plain text file uploads, which are obviously straightforward), but result display leaves a lot to be desired. I anticipate rewriting the whole thing to provide cleaner, better-looking integration into MediaWiki at some point soonish.
Rob Church
On 09/04/07, Ian Smith ismith@good.com wrote:
So, in an attempt to take the heat out of this and get to facts, what I think you're looking for is an extension to:
- allow admins to configure decoders for specific document types
- run the right decoder (if any) when a document is uploaded
- add the resulting plain text to the "searchindex" table.
You would then have to find, install and configure decoders for your most-used document types.
Sounds good. e.g. Antiword is a quick way to turn a Word document into indexable text.
Now all MediaWiki needs is a text search that doesn't suck ;-p
- d.
David Gerard wrote:
Ian Smith wrote:
So, in an attempt to take the heat out of this and get to facts, what I think you're looking for is an extension to:
- allow admins to configure decoders for specific document types
- run the right decoder (if any) when a document is uploaded
- add the resulting plain text to the "searchindex" table.
You would then have to find, install and configure decoders for your most-used document types.
Sounds good. e.g. Antiword is a quick way to turn a Word document into indexable text.
AbiWord --to=txt
On 10/04/07, Platonides Platonides@gmail.com wrote:
David Gerard wrote:
Sounds good. e.g. Antiword is a quick way to turn a Word document into indexable text.
AbiWord --to=txt http://wvware.sourceforge.net/
There's lots of ways, yes :-) Something to extract indexable text from any document there's a filter for, and feed it to the indexer. That'd be just what we need to index Word documents added to a MediaWiki. Does anything like this exist already, or is it a Simple Matter Of Programming?
- d.
On 10/04/07, David Gerard dgerard@gmail.com wrote:
There's lots of ways, yes :-) Something to extract indexable text from any document there's a filter for, and feed it to the indexer. That'd be just what we need to index Word documents added to a MediaWiki. Does anything like this exist already, or is it a Simple Matter Of Programming?
http://lists.wikimedia.org/pipermail/mediawiki-l/2007-April/019490.html http://svn.wikimedia.org/viewvc/mediawiki/trunk/extensions/FileSearch/
Rob Church
Dave Sigafoos wrote:
Unless MS removed every 'text' word from their document I don't see where an extension that could index the words would be a moving target.
I have a lot of old documents in IBM's Bookmanager format. They are encrypted in such a way that no one can scan past the formating info and find the text. I rely on some old, buggy Bookmanager software to access them and expect that with another change of OS version I will lose the ability to use them. IBM has never released the internal format specification for the documents and nothing I've been able to do has wrested the info from them.
I keep expecting MS to pull a similar stunt with Word. They could sell the encryption as a "security" feature.
Do you think that MS gives a crap whether MW is or is not capable of indexing word documents? Unless, of course you feel able to calling Bill and having him change his format.
MS might give a crap about the laws currently being pushed out that force open standards for document storage. These governments don't want to have docs stored that become obsolete because of one vendor's decision to change their format. This was what I was thinking of when I mentioned saving in an open standard. Of course, the battle right now is MS's version of it's "open standard" versus the open source community's desire for a truly open standard.
It's been in the computer news so much lately I thought you'd get the drift of the comments. I guess I shouldn't have been so obscure. Sorry!
I believe that the market place will decide, rightly or wrongly (and who decides that?), what tools and "standards" will be used.
It looks like the elected reps will beat the market to it. That of course brings its own risks/rewards.
Of course right now it doesn't really matter as the only TYPE is IMAGE and MW doesn't appear to be able to search on it (which makes sense if the only type I wanted to store was IMAGE.
Searching on images is a major problem. Do you search only on names of images, on descriptions or on content? Searching content is still a significant research effort in image processing and recognition.
The only restriction on what you can upload is in the extension list in Localsettings.php. My wiki allows specific text uploads. There's an extension I'm working on that processes them (mod of an existing extension). I don't see any reason why you couldn't make an extension that does the same with any chosen doc format.
Mike
mediawiki-l@lists.wikimedia.org