Hello,
I read the thread "how bad is a category with ....", and I was wondering how categories were filled. If I understand well, categories were filled by editors of the article. This assume that these editors know the whole set of categories and that these categories will not change with time ? I was wondering if there is projects to help *detect* categories and then to help editors by *suggesting* categories ?
I am thinking about two different technologies to help dealing with these two problems : 1) Text clustering to help finding categories but probably not using classical approaches where words space is used to describe a document (applying a part of speech tagging http://en.wikipedia.org/wiki/Part-of-speech_tagging, stemming http://en.wikipedia.org/wiki/Stemmer, ...). I am thinking about clustering links graph (seems similar to the clique problem http://en.wikipedia.org/wiki/Clique_problem but with different constraints), i.e. each document will not be described by his words (or lemmas, LSA vector...) but by his links to other articles using an algorithm that do not needs the number of cluster before processing but needs a distance or a similarity threshold. With this kind of processing, you will have a set of clusters that are linked together, but a cluster will probably not be a complete graph (this is the difference with the clique problem). Once you have the clusters, you need to try labeling them with a category : - give to the user the role of identifying the category name - use the words space to find the better words that describe this set of articles - ... Then you can run this algorithm on a category to try to split it in sub categories.
2) Machine learning or links graph exploration to suggest categories during edition of an article. This first idea is to try to learn existing categories with a machine learning algorithm (using words space) to guess categories of a new article (but this algorithm will have to deal with the new categories and the fact that the number of document not having a category is grater than number of document having a category). The second idea is really more simple and easier to implement : When you edit an article, you can suggest categories of linked articles (can be replaced by an other graph-exploration algorithm).
Is there some functions like these in Wikimedia ? and to you think that this kind of algorithms could help ? Finally, do you know people working on this functionalities (maybe people working on semantic web ?)
Best Regards. Julien Lemoine