Hi,
I'm a wikipedian and have helped on a few articles. I've noticed the new categorization feature and have a suggestion.
I happen to have professional experience with automated categorization software, and think you could use it to automatically populate categories with relevant articles.
The basic approach:
- Prepare a dataset - Specifiy a taxonomy to be learned, e.g. the current category taxonomy. - Populate the taxonomy with example documents, e.g. a sample of existing articles from each category. - "Learn a odel" using a statistical text classifier. - This requires a fair amount of server time.. say a day or so on a beefy server with lots of RAM. - Evaluate the results and select well modelled categories. - Set up a classification server that funnels all wikipedia articles through the selected categories. - Those articles/nodes that are a strong match for a category get annontated with a special tag showing an automated categorization has been made and a link is added to the article page back to the category. - The category lists all such articles/nodes, sorted by confidence, size, etc.
If this sounds interesting, I may be able to get a donated copy of ReelTwo's classification system:
http://reeltwo.com/products.html
for permanent use in the wikipedia. It's very high performance.
ReelTwo is my old company. It's a small datamining shop. The staff are all supporters of OSS and enjoy using wikipedia, and I think they'd appreciate the opportunity to contribute to wikipedia, and also be proud to be associated. As would I.
Please let me know if you have any questions.
Cheers, Pablo Mayrgundter freality.org/~pablo
wikitech-l@lists.wikimedia.org