Hallo,
Is there any known easy way to classify Wikipedia articles into a relatively small number of types?
By "relatively small" I mean no more than twenty, and by "types" I mean things that are intuitively clear to readers, for example: * Biographies * Articles about scientific phenomena (can be sub-grouped to math, astronomy, physics, geology, medicine) * Articles about works of art (paintings, movies, books, records, statues) * Articles about places * Articles about historical events * Articles about biological species * Articles that mostly present data, such as demography or results of competitions (sports, elections, game shows)
There are a few more, but not much. I hope that you get the idea.
We have categories, but I'm not sure that it's easy to use categories for such things because of the very loose category structure. For example, [[Eurovision 2007]] is somewhere under [[Category:Humans]], even though it's not an article about a human.
Such information can be useful for study about the types of articles that different people write. In particular, I thought about it in the context of analyzing the types of articles that people are translating now (manually) and will translate in the future using the ContentTranslation, which is in its early stages of development.
Thanks,
-- Amir Elisha Aharoni · אָמִיר אֱלִישָׁע אַהֲרוֹנִי http://aharoni.wordpress.com “We're living in pieces, I want to live in peace.” – T. Moore
Amir E. Aharoni, 17/03/2014 16:21:
Is there any known easy way to classify Wikipedia articles into a relatively small number of types?
By "relatively small" I mean no more than twenty, and by "types" I mean things that are intuitively clear to readers [...]
Your examples don't really seem "topics" to me, but as far as I remember there were some/several papers on how to classify articles by topics purely with statistical analysis of the words contained in it. Things like http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=4624691; but also http://dl.acm.org/citation.cfm?id=1620887 (only read abstracts).
Nemo
I recommend Wikidata for that. Maybe not quite as complete as Wikipedia categories yet, but much more precise.
[BEGIN SHAMELESS PLUG]
And you can make live, complex queries: http://wikidata-wdq-mm.instance-proxy.wmflabs.org/wdq/
[END SHAMELESS PLUG]
On Mon, Mar 17, 2014 at 3:21 PM, Amir E. Aharoni < amir.aharoni@mail.huji.ac.il> wrote:
Hallo,
Is there any known easy way to classify Wikipedia articles into a relatively small number of types?
By "relatively small" I mean no more than twenty, and by "types" I mean things that are intuitively clear to readers, for example:
- Biographies
- Articles about scientific phenomena (can be sub-grouped to math,
astronomy, physics, geology, medicine)
- Articles about works of art (paintings, movies, books, records, statues)
- Articles about places
- Articles about historical events
- Articles about biological species
- Articles that mostly present data, such as demography or results of
competitions (sports, elections, game shows)
There are a few more, but not much. I hope that you get the idea.
We have categories, but I'm not sure that it's easy to use categories for such things because of the very loose category structure. For example, [[Eurovision 2007]] is somewhere under [[Category:Humans]], even though it's not an article about a human.
Such information can be useful for study about the types of articles that different people write. In particular, I thought about it in the context of analyzing the types of articles that people are translating now (manually) and will translate in the future using the ContentTranslation, which is in its early stages of development.
Thanks,
-- Amir Elisha Aharoni · אָמִיר אֱלִישָׁע אַהֲרוֹנִי http://aharoni.wordpress.com “We're living in pieces, I want to live in peace.” – T. Moore
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Hi, Amir,
I have some experience on topic modeling but these may not be a direct answer.
The most adopted techniques to model topics of documents is LDA[1] or LSI[2]. Under these techniques, document is viewed as a mixture of topics, while topic is a mixture of words. Both methods are well implemented in different language, for example, gensim[3] in python. But these methods are relatively expensive.
Last year a word vector model - word2vec[4] - was introduced by Google. By combining a topic catalog, we can easily decide which topic an article belongs to. The topic catalog is just a list of topics and each topic is a list of related words.
We released one open-sourced project on this direction: * https://github.com/guokr/simbase
And another planned project on the topic catalog * https://github.com/guokr/opentopics
We will update the catalog in the coming weeks and give more details.
[1]https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation [2]https://en.wikipedia.org/wiki/Latent_semantic_indexing [3]https://en.wikipedia.org/wiki/Gensim [4]https://code.google.com/p/word2vec/
On Mon, Mar 17, 2014 at 11:21 PM, Amir E. Aharoni < amir.aharoni@mail.huji.ac.il> wrote:
Hallo,
Is there any known easy way to classify Wikipedia articles into a relatively small number of types?
By "relatively small" I mean no more than twenty, and by "types" I mean things that are intuitively clear to readers, for example:
- Biographies
- Articles about scientific phenomena (can be sub-grouped to math,
astronomy, physics, geology, medicine)
- Articles about works of art (paintings, movies, books, records, statues)
- Articles about places
- Articles about historical events
- Articles about biological species
- Articles that mostly present data, such as demography or results of
competitions (sports, elections, game shows)
There are a few more, but not much. I hope that you get the idea.
We have categories, but I'm not sure that it's easy to use categories for such things because of the very loose category structure. For example, [[Eurovision 2007]] is somewhere under [[Category:Humans]], even though it's not an article about a human.
Such information can be useful for study about the types of articles that different people write. In particular, I thought about it in the context of analyzing the types of articles that people are translating now (manually) and will translate in the future using the ContentTranslation, which is in its early stages of development.
Thanks,
-- Amir Elisha Aharoni · אָמִיר אֱלִישָׁע אַהֲרוֹנִי http://aharoni.wordpress.com “We're living in pieces, I want to live in peace.” – T. Moore
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Dear Amir,
two years ago, we have utilized Wikipedia categories to analyze the distribution of articles over a set of main topics. We used the 24 direct subcategories of "Category:Main topic classifications" as main topics. For further information, see Section 4.2 in this paper: http://www.uni-weimar.de/medien/webis/publications/papers/stein_2012d.pdf
Best regards, Maik
Hello Amir, The question rising would be for me: what do you use the classification for? Depending on that you can get a lot different answers. The biography of Otto von Bismarck may be in the category "history", the biography of Justin Bieber in "entertainment". Kind regards Ziko
2014-03-18 8:31 GMT+01:00 Maik Anderka maik.anderka@uni-paderborn.de:
Dear Amir,
two years ago, we have utilized Wikipedia categories to analyze the distribution of articles over a set of main topics. We used the 24 direct subcategories of "Category:Main topic classifications" as main topics. For further information, see Section 4.2 in this paper: http://www.uni-weimar.de/medien/webis/publications/papers/stein_2012d.pdf
Best regards, Maik
-- Maik Anderka Research Group "Knowledge-Based Systems" Department of Computer Science University of Paderborn, Germany http://www.uni-paderborn.de/cs/ag-klbue
Am 17.03.2014 16:21, schrieb Amir E. Aharoni:
Hallo,
Is there any known easy way to classify Wikipedia articles into a relatively small number of types?
By "relatively small" I mean no more than twenty, and by "types" I mean things that are intuitively clear to readers, for example:
- Biographies
- Articles about scientific phenomena (can be sub-grouped to math,
astronomy, physics, geology, medicine)
- Articles about works of art (paintings, movies, books, records, statues)
- Articles about places
- Articles about historical events
- Articles about biological species
- Articles that mostly present data, such as demography or results of
competitions (sports, elections, game shows)
There are a few more, but not much. I hope that you get the idea.
We have categories, but I'm not sure that it's easy to use categories for such things because of the very loose category structure. For example, [[Eurovision 2007]] is somewhere under [[Category:Humans]], even though it's not an article about a human.
Such information can be useful for study about the types of articles that different people write. In particular, I thought about it in the context of analyzing the types of articles that people are translating now (manually) and will translate in the future using the ContentTranslation, which is in its early stages of development.
Thanks,
-- Amir Elisha Aharoni · אָמִיר אֱלִישָׁע אַהֲרוֹנִי http://aharoni.wordpress.com “We're living in pieces, I want to live in peace.” – T. Moore
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
It's not so much "topics", as I wrote in the email subject, but more like "types", as I wrote in the email body. Sorry about the confusion.
We are getting serious about analyzing how do people translate articles. Basically, all articles are worth translating, but we may find, for example, that Wikipedia has 60% biographies, 30% articles about places and 10% articles about math, but of the translated articles, 80% are about places, and biographies and and math are 10% each. So if this will be the case, we may want to understand why don't people translate articles about biographies and math more - are they simply less interesting? is it harder for some social reason? for some technical reason? If there is something that we can do to make translation easier, we may want to do it.
This example is, of course, highly simplified and the numbers are completely made up, but I hope that it explains the intention.
Now when I say "biographies", "articles about places" and "articles about math", it's immediately clear and intuitive to a person what I'm talking about. I am asking whether there is a known easy way for software to understand such things.
-- Amir Elisha Aharoni · אָמִיר אֱלִישָׁע אַהֲרוֹנִי http://aharoni.wordpress.com “We're living in pieces, I want to live in peace.” – T. Moore
2014-03-18 14:34 GMT+02:00 Ziko van Dijk zvandijk@gmail.com:
Hello Amir, The question rising would be for me: what do you use the classification for? Depending on that you can get a lot different answers. The biography of Otto von Bismarck may be in the category "history", the biography of Justin Bieber in "entertainment". Kind regards Ziko
2014-03-18 8:31 GMT+01:00 Maik Anderka maik.anderka@uni-paderborn.de:
Dear Amir,
two years ago, we have utilized Wikipedia categories to analyze the distribution of articles over a set of main topics. We used the 24 direct subcategories of "Category:Main topic classifications" as main topics.
For
further information, see Section 4.2 in this paper:
http://www.uni-weimar.de/medien/webis/publications/papers/stein_2012d.pdf
Best regards, Maik
-- Maik Anderka Research Group "Knowledge-Based Systems" Department of Computer Science University of Paderborn, Germany http://www.uni-paderborn.de/cs/ag-klbue
Am 17.03.2014 16:21, schrieb Amir E. Aharoni:
Hallo,
Is there any known easy way to classify Wikipedia articles into a
relatively
small number of types?
By "relatively small" I mean no more than twenty, and by "types" I mean things that are intuitively clear to readers, for example:
- Biographies
- Articles about scientific phenomena (can be sub-grouped to math,
astronomy, physics, geology, medicine)
- Articles about works of art (paintings, movies, books, records,
statues)
- Articles about places
- Articles about historical events
- Articles about biological species
- Articles that mostly present data, such as demography or results of
competitions (sports, elections, game shows)
There are a few more, but not much. I hope that you get the idea.
We have categories, but I'm not sure that it's easy to use categories for such things because of the very loose category structure. For example, [[Eurovision 2007]] is somewhere under [[Category:Humans]], even though
it's
not an article about a human.
Such information can be useful for study about the types of articles that different people write. In particular, I thought about it in the context
of
analyzing the types of articles that people are translating now
(manually)
and will translate in the future using the ContentTranslation, which is
in
its early stages of development.
Thanks,
-- Amir Elisha Aharoni · אָמִיר אֱלִישָׁע אַהֲרוֹנִי http://aharoni.wordpress.com “We're living in pieces, I want to live in peace.” – T. Moore
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Okay, thanks Amir.
Doesn't it have to do mainly with the structure of the article? A biography article has a very chronological outline. Math... I think it is like "here there problem, there the solution"? City ... well, there are some common patterns. Historical events often: "Naming", "Prehistory", "Event", "Aftermath", "Reception" ("what people say" or "What historiographers say")...
They say that biography articles are very suitable for beginners or non historians, because the structure is relatively simple (as Goethe said: "He lived, had a wife, and died."). Possibly they are also easier to translate...
So far some thoughts. :-) Ziko
2014-03-18 14:39 GMT+01:00 Amir E. Aharoni amir.aharoni@mail.huji.ac.il:
It's not so much "topics", as I wrote in the email subject, but more like "types", as I wrote in the email body. Sorry about the confusion.
We are getting serious about analyzing how do people translate articles. Basically, all articles are worth translating, but we may find, for example, that Wikipedia has 60% biographies, 30% articles about places and 10% articles about math, but of the translated articles, 80% are about places, and biographies and and math are 10% each. So if this will be the case, we may want to understand why don't people translate articles about biographies and math more - are they simply less interesting? is it harder for some social reason? for some technical reason? If there is something that we can do to make translation easier, we may want to do it.
This example is, of course, highly simplified and the numbers are completely made up, but I hope that it explains the intention.
Now when I say "biographies", "articles about places" and "articles about math", it's immediately clear and intuitive to a person what I'm talking about. I am asking whether there is a known easy way for software to understand such things.
-- Amir Elisha Aharoni · אָמִיר אֱלִישָׁע אַהֲרוֹנִי http://aharoni.wordpress.com “We're living in pieces, I want to live in peace.” – T. Moore
2014-03-18 14:34 GMT+02:00 Ziko van Dijk zvandijk@gmail.com:
Hello Amir, The question rising would be for me: what do you use the classification for? Depending on that you can get a lot different answers. The biography of Otto von Bismarck may be in the category "history", the biography of Justin Bieber in "entertainment". Kind regards Ziko
2014-03-18 8:31 GMT+01:00 Maik Anderka maik.anderka@uni-paderborn.de:
Dear Amir,
two years ago, we have utilized Wikipedia categories to analyze the distribution of articles over a set of main topics. We used the 24 direct subcategories of "Category:Main topic classifications" as main topics. For further information, see Section 4.2 in this paper:
http://www.uni-weimar.de/medien/webis/publications/papers/stein_2012d.pdf
Best regards, Maik
-- Maik Anderka Research Group "Knowledge-Based Systems" Department of Computer Science University of Paderborn, Germany http://www.uni-paderborn.de/cs/ag-klbue
Am 17.03.2014 16:21, schrieb Amir E. Aharoni:
Hallo,
Is there any known easy way to classify Wikipedia articles into a relatively small number of types?
By "relatively small" I mean no more than twenty, and by "types" I mean things that are intuitively clear to readers, for example:
- Biographies
- Articles about scientific phenomena (can be sub-grouped to math,
astronomy, physics, geology, medicine)
- Articles about works of art (paintings, movies, books, records,
statues)
- Articles about places
- Articles about historical events
- Articles about biological species
- Articles that mostly present data, such as demography or results of
competitions (sports, elections, game shows)
There are a few more, but not much. I hope that you get the idea.
We have categories, but I'm not sure that it's easy to use categories for such things because of the very loose category structure. For example, [[Eurovision 2007]] is somewhere under [[Category:Humans]], even though it's not an article about a human.
Such information can be useful for study about the types of articles that different people write. In particular, I thought about it in the context of analyzing the types of articles that people are translating now (manually) and will translate in the future using the ContentTranslation, which is in its early stages of development.
Thanks,
-- Amir Elisha Aharoni · אָמִיר אֱלִישָׁע אַהֲרוֹנִי http://aharoni.wordpress.com “We're living in pieces, I want to live in peace.” – T. Moore
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
You may take a look at the DBpedia ontology and to the instance-type mapping.
I guess that wikidata employs a similar classification.
HTH
G On Mar 17, 2014 11:21 AM, "Amir E. Aharoni" amir.aharoni@mail.huji.ac.il wrote:
Hallo,
Is there any known easy way to classify Wikipedia articles into a relatively small number of types?
By "relatively small" I mean no more than twenty, and by "types" I mean things that are intuitively clear to readers, for example:
- Biographies
- Articles about scientific phenomena (can be sub-grouped to math,
astronomy, physics, geology, medicine)
- Articles about works of art (paintings, movies, books, records, statues)
- Articles about places
- Articles about historical events
- Articles about biological species
- Articles that mostly present data, such as demography or results of
competitions (sports, elections, game shows)
There are a few more, but not much. I hope that you get the idea.
We have categories, but I'm not sure that it's easy to use categories for such things because of the very loose category structure. For example, [[Eurovision 2007]] is somewhere under [[Category:Humans]], even though it's not an article about a human.
Such information can be useful for study about the types of articles that different people write. In particular, I thought about it in the context of analyzing the types of articles that people are translating now (manually) and will translate in the future using the ContentTranslation, which is in its early stages of development.
Thanks,
-- Amir Elisha Aharoni · אָמִיר אֱלִישָׁע אַהֲרוֹנִי http://aharoni.wordpress.com “We're living in pieces, I want to live in peace.” – T. Moore
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
if you’re not interested in actual topic extraction a good heuristic to identify high-level topic areas is to rely on Wikiprojects on the English Wikipedia and then use language links from Wikidata to apply them to other languages. That won’t immediately cover articles that only exist in one language, but it’s the most effective heuristic I can think of for your use case.
Dario
On Mar 17, 2014, at 8:21 AM, Amir E. Aharoni amir.aharoni@mail.huji.ac.il wrote:
Hallo,
Is there any known easy way to classify Wikipedia articles into a relatively small number of types?
By "relatively small" I mean no more than twenty, and by "types" I mean things that are intuitively clear to readers, for example:
- Biographies
- Articles about scientific phenomena (can be sub-grouped to math, astronomy, physics, geology, medicine)
- Articles about works of art (paintings, movies, books, records, statues)
- Articles about places
- Articles about historical events
- Articles about biological species
- Articles that mostly present data, such as demography or results of competitions (sports, elections, game shows)
There are a few more, but not much. I hope that you get the idea.
We have categories, but I'm not sure that it's easy to use categories for such things because of the very loose category structure. For example, [[Eurovision 2007]] is somewhere under [[Category:Humans]], even though it's not an article about a human.
Such information can be useful for study about the types of articles that different people write. In particular, I thought about it in the context of analyzing the types of articles that people are translating now (manually) and will translate in the future using the ContentTranslation, which is in its early stages of development.
Thanks,
-- Amir Elisha Aharoni · אָמִיר אֱלִישָׁע אַהֲרוֹנִי http://aharoni.wordpress.com “We're living in pieces, I want to live in peace.” – T. Moore _______________________________________________ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
wiki-research-l@lists.wikimedia.org