Hi, Amir,

I have some experience on topic modeling but these may not be a direct answer.

The most adopted techniques to model topics of documents is LDA[1] or LSI[2].

Under these techniques, document is viewed as a mixture of topics, while topic is a mixture of words.

Both methods are well implemented in different language, for example, gensim[3] in python.

But these methods are relatively expensive.

Last year a word vector model - word2vec[4] - was introduced by Google.

By combining a topic catalog, we can easily decide which topic an article belongs to.

The topic catalog is just a list of topics and each topic is a list of related words.

We released one open-sourced project on this direction:

* https://github.com/guokr/simbase

And another planned project on the topic catalog

* https://github.com/guokr/opentopics

We will update the catalog in the coming weeks and give more details.

[1]https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation

[2]https://en.wikipedia.org/wiki/Latent_semantic_indexing

[3]https://en.wikipedia.org/wiki/Gensim

[4]https://code.google.com/p/word2vec/

On Mon, Mar 17, 2014 at 11:21 PM, Amir E. Aharoni <amir.aharoni@mail.huji.ac.il> wrote:

Hallo,

Is there any known easy way to classify Wikipedia articles into a relatively small number of types?

By "relatively small" I mean no more than twenty, and by "types" I mean things that are intuitively clear to readers, for example:

* Biographies
* Articles about scientific phenomena (can be sub-grouped to math, astronomy, physics, geology, medicine)
* Articles about works of art (paintings, movies, books, records, statues)

* Articles about places
* Articles about historical events
* Articles about biological species
* Articles that mostly present data, such as demography or results of competitions (sports, elections, game shows)

There are a few more, but not much. I hope that you get the idea.

We have categories, but I'm not sure that it's easy to use categories for such things because of the very loose category structure. For example, [[Eurovision 2007]] is somewhere under [[Category:Humans]], even though it's not an article about a human.

Such information can be useful for study about the types of articles that different people write. In particular, I thought about it in the context of analyzing the types of articles that people are translating now (manually) and will translate in the future using the ContentTranslation, which is in its early stages of development.

Thanks,

--
Amir Elisha Aharoni · אָמִיר אֱלִישָׁע אַהֲרוֹנִי
http://aharoni.wordpress.com
‪“We're living in pieces,
I want to live in peace.” – T. Moore‬

_______________________________________________
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l