Hi,
I'm a wikipedian and have helped on a few articles. I've noticed the new categorization feature and have a suggestion.
I happen to have professional experience with automated categorization software, and think you could use it to automatically populate categories with relevant articles.
The basic approach:
- Prepare a dataset - Specifiy a taxonomy to be learned, e.g. the current category taxonomy. - Populate the taxonomy with example documents, e.g. a sample of existing articles from each category. - "Learn a odel" using a statistical text classifier. - This requires a fair amount of server time.. say a day or so on a beefy server with lots of RAM. - Evaluate the results and select well modelled categories. - Set up a classification server that funnels all wikipedia articles through the selected categories. - Those articles/nodes that are a strong match for a category get annontated with a special tag showing an automated categorization has been made and a link is added to the article page back to the category. - The category lists all such articles/nodes, sorted by confidence, size, etc.
If this sounds interesting, I may be able to get a donated copy of ReelTwo's classification system:
http://reeltwo.com/products.html
for permanent use in the wikipedia. It's very high performance.
ReelTwo is my old company. It's a small datamining shop. The staff are all supporters of OSS and enjoy using wikipedia, and I think they'd appreciate the opportunity to contribute to wikipedia, and also be proud to be associated. As would I.
Please let me know if you have any questions.
Cheers, Pablo Mayrgundter freality.org/~pablo
On Wed, 17 Nov 2004 23:36:13 +0000 (UTC), Pablo Mayrgundter pablo@freality.com wrote:
I happen to have professional experience with automated categorization software, and think you could use it to automatically populate categories with relevant articles.
[details removed for convenience]
It's certainly an interesting idea. (And as a graduate in artificial intelligence, I'm personally intrigued as to what kind of classifier is under the hood, but that would be off-topic, so we'll leave that for another time). I think the biggest concern would be that the Wikimedia Foundation itself really really can't afford that kind of server-resources to carry out the task "officially".
I would therefore suggest it would be better for you to run the software yourself (using a database dump from http://download.wikimedia.org/) and then to publish the results to the wiki(s) in question somehow. Perhaps you could write and run a bot (or collaborate with someone who wanted to) which could add the automated suggestions to relevant talk pages: "The following may belong here..." to Category_talk: pages, and/or "This may belong to the following categories..." to the Talk: pages of categories. Or even, if you felt particularly dedicated, a bot that helped you add the [[Category:...]] tag yourself where it was indeed appropriate.
[Note that you should read http://en.wikipedia.org/wiki/Wikipedia:Bots for the official policy on bot use in [English?] Wikipedia.]
Good luck...
On Fri, 19 Nov 2004 00:11:35 +0000, Rowan Collins rowan.collins@gmail.com wrote:
It's certainly an interesting idea. (And as a graduate in artificial intelligence, I'm personally intrigued as to what kind of classifier is under the hood, but that would be off-topic, so we'll leave that for another time).
Apart from a simple textual analysis (an article with the word "molecules" is quite likely to be about chemistry, an article with the word "wolf" much less), the existing wiki-links (both forward and backward) give much hints - if a page links to or is linked from several pages already in my set, that increases the chance it also is.
I would therefore suggest it would be better for you to run the software yourself (using a database dump from http://download.wikimedia.org/) and then to publish the results to the wiki(s) in question somehow. Perhaps you could write and run a bot (or collaborate with someone who wanted to) which could add the automated suggestions to relevant talk pages: "The following may belong here..." to Category_talk: pages, and/or "This may belong to the following categories..." to the Talk: pages of categories. Or even, if you felt particularly dedicated, a bot that helped you add the [[Category:...]] tag yourself where it was indeed appropriate.
This last one seems like the way to go - a bot that asks you one page after one another a yes/no question whether to include it. This can be done quite fast, much faster than peopling a category by hand, and the answers that have already been given can of course be used in a type of learning algorithm to be even better in sorting out later candidates (although doing so would probably much slow down the bot, so perhaps it's better to get a start on the 'include' and 'exclude' categories by hand, then go offline to make the estimates, and after that do the real work in the abovementioned way).
In fact I have already made such a bot, using the simple algorithm of asking all pages linked to or from a page already included to create a list. However, this was before Categories were implemented, and I have not gone to reprogram it for that purpose yet.
Andre Engels
On Fri, 19 Nov 2004 17:58:57 +0100, Andre Engels andreengels@gmail.com wrote:
...the answers that have already been given can of course be used in a type of learning algorithm to be even better in sorting out later candidates (although doing so would probably much slow down the bot, so perhaps it's better to get a start on the 'include' and 'exclude' categories by hand, then go offline to make the estimates, and after that do the real work in the abovementioned way).
Well, given that both the content of articles and, more importantly, the hierarchy of categories, will change over time, I'd have thought it would be necessary/useful to re-run the entire classifier algorithm every now and then anyway; so perhaps a log of human 'rejected'/'approved' responses from running the bot could be fed back in as an extra data source somehow...
Rowan Collins wrote:
On Fri, 19 Nov 2004 17:58:57 +0100, Andre Engels andreengels@gmail.com wrote:
...the answers that have already been given can of course be used in a type of learning algorithm to be even better in sorting out later candidates (although doing so would probably much slow down the bot, so perhaps it's better to get a start on the 'include' and 'exclude' categories by hand, then go offline to make the estimates, and after that do the real work in the abovementioned way).
Well, given that both the content of articles and, more importantly, the hierarchy of categories, will change over time, I'd have thought it would be necessary/useful to re-run the entire classifier algorithm every now and then anyway; so perhaps a log of human 'rejected'/'approved' responses from running the bot could be fed back in as an extra data source somehow...
Before we consider automatic classification, we should first use all the information in the various human-generated lists. Many of these map directly to categories, and could be used to auto-categorise articles with no AI effort needed. I suggest a concerted bot-aided effort to data-mine the existing list information would be very useful.
The more genuine classification seed-information we have as input, the better AI algorithms will be able to work. None of this need hold up efforts on developing auto-categorization algorithms; but it will provide a much better dataset for testing them, thus hopefully greatly enhancing their performance when they are rolled out.
Perhaps we should have an article-categorization developers group?
-- Neil
wikitech-l@lists.wikimedia.org