Hi, Whats the simplest way to extract complete and the latest Wikipedia Category Hierarchy?
Thanks, Pavan
On Wed, May 29, 2013 at 4:33 PM, Pavan Kapanipathi pavan@knoesis.org wrote:
Hi, Whats the simplest way to extract complete and the latest Wikipedia Category Hierarchy?
Note that the "category hierarchy" isn't much of a hierarchy; it's a directed graph with cycles and no particular root.
Your best bet is probably to download a database dump from http://dumps.wikimedia.org/enwiki/latest/ and process it.
Hi Brad, actually is there a schema of the graph? you said "directed", hence there should some hierarchy in it too.' I read there are 50K+ categories, maybe is there any list of "directions" aka "links" of how categories are connected?
On Wed, May 29, 2013 at 10:51 PM, Brad Jorsch bjorsch@wikimedia.org wrote:
On Wed, May 29, 2013 at 4:33 PM, Pavan Kapanipathi pavan@knoesis.org wrote:
Hi, Whats the simplest way to extract complete and the latest Wikipedia
Category
Hierarchy?
Note that the "category hierarchy" isn't much of a hierarchy; it's a directed graph with cycles and no particular root.
Your best bet is probably to download a database dump from http://dumps.wikimedia.org/enwiki/latest/ and process it.
-- Brad Jorsch Software Engineer Wikimedia Foundation
Mediawiki-api mailing list Mediawiki-api@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/mediawiki-api
actually is there a schema of the graph? you said "directed", hence there should some hierarchy in it too.'
I'm not sure I understand what you're asking, but the hierarchy is based on the relation between a category and its subcategories. In more technical terms, in the directed graph of categories, edges are from a category to its subcategories.
I read there are 50K+ categories, maybe is there any list of "directions" aka "links" of how categories are connected?
Actually, there seem to be 1M+ categories (though not all of them are normal article categories). And category-subcategory links is exactly what the categorylinks dump contains, along with category-page links.
Petr Onderka [[en:User:Svick]]
Oh I see! (50K+ was based on a hold paper I read... 1M+ hehe wiki is growing :) )
thank you for the suggestions to the tables!
On Thu, May 30, 2013 at 1:33 AM, Petr Onderka gsvick@gmail.com wrote:
actually is there a schema of the graph?
you said "directed", hence there should some hierarchy in it too.'
I'm not sure I understand what you're asking, but the hierarchy is based on the relation between a category and its subcategories. In more technical terms, in the directed graph of categories, edges are from a category to its subcategories.
I read there are 50K+ categories, maybe is there any list of "directions" aka "links" of how categories are connected?
Actually, there seem to be 1M+ categories (though not all of them are normal article categories). And category-subcategory links is exactly what the categorylinks dump contains, along with category-page links.
Petr Onderka [[en:User:Svick]]
Mediawiki-api mailing list Mediawiki-api@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/mediawiki-api
Thank you Mark,
I was indeed looking for the tables to construct that structure. You also confirm categorylinks and categorypages are enough?
On Thu, May 30, 2013 at 10:49 AM, Luigi Assom luigi.assom@gmail.com wrote:
Oh I see! (50K+ was based on a hold paper I read... 1M+ hehe wiki is growing :) )
thank you for the suggestions to the tables!
On Thu, May 30, 2013 at 1:33 AM, Petr Onderka gsvick@gmail.com wrote:
actually is there a schema of the graph?
you said "directed", hence there should some hierarchy in it too.'
I'm not sure I understand what you're asking, but the hierarchy is based on the relation between a category and its subcategories. In more technical terms, in the directed graph of categories, edges are from a category to its subcategories.
I read there are 50K+ categories, maybe is there any list of "directions" aka "links" of how categories are connected?
Actually, there seem to be 1M+ categories (though not all of them are normal article categories). And category-subcategory links is exactly what the categorylinks dump contains, along with category-page links.
Petr Onderka [[en:User:Svick]]
Mediawiki-api mailing list Mediawiki-api@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/mediawiki-api
-- Luigi Assom
Skype contact: oggigigi
The simplest way is of course to use Dbpedia for that and create a SPARQL query that will give you all the skos:broader terms, e.g. http://dbpedia.org/page/Category:Italian_Roman_Catholics ----- Yury Katkov, WikiVote
On Thu, May 30, 2013 at 1:05 PM, Luigi Assom luigi.assom@gmail.com wrote:
Thank you Mark,
I was indeed looking for the tables to construct that structure. You also confirm categorylinks and categorypages are enough?
On Thu, May 30, 2013 at 10:49 AM, Luigi Assom luigi.assom@gmail.com wrote:
Oh I see! (50K+ was based on a hold paper I read... 1M+ hehe wiki is growing :) )
thank you for the suggestions to the tables!
On Thu, May 30, 2013 at 1:33 AM, Petr Onderka gsvick@gmail.com wrote:
actually is there a schema of the graph? you said "directed", hence there should some hierarchy in it too.'
I'm not sure I understand what you're asking, but the hierarchy is based on the relation between a category and its subcategories. In more technical terms, in the directed graph of categories, edges are from a category to its subcategories.
I read there are 50K+ categories, maybe is there any list of "directions" aka "links" of how categories are connected?
Actually, there seem to be 1M+ categories (though not all of them are normal article categories). And category-subcategory links is exactly what the categorylinks dump contains, along with category-page links.
Petr Onderka [[en:User:Svick]]
Mediawiki-api mailing list Mediawiki-api@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/mediawiki-api
-- Luigi Assom
Skype contact: oggigigi
-- Luigi Assom
Skype contact: oggigigi
Mediawiki-api mailing list Mediawiki-api@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/mediawiki-api
Thank you Yury, I am looking for something like that but for all categories for analysis. I think I will use the db of wiki.
On Thu, May 30, 2013 at 5:51 PM, Yury Katkov katkov.juriy@gmail.com wrote:
The simplest way is of course to use Dbpedia for that and create a SPARQL query that will give you all the skos:broader terms, e.g. http://dbpedia.org/page/Category:Italian_Roman_Catholics
Yury Katkov, WikiVote
On Thu, May 30, 2013 at 1:05 PM, Luigi Assom luigi.assom@gmail.com wrote:
Thank you Mark,
I was indeed looking for the tables to construct that structure. You also confirm categorylinks and categorypages are enough?
On Thu, May 30, 2013 at 10:49 AM, Luigi Assom luigi.assom@gmail.com
wrote:
Oh I see! (50K+ was based on a hold paper I read... 1M+ hehe wiki is growing :) )
thank you for the suggestions to the tables!
On Thu, May 30, 2013 at 1:33 AM, Petr Onderka gsvick@gmail.com wrote:
actually is there a schema of the graph? you said "directed", hence there should some hierarchy in it too.'
I'm not sure I understand what you're asking, but the hierarchy is
based
on the relation between a category and its subcategories. In more technical terms, in the directed graph of categories, edges are from a category to its subcategories.
I read there are 50K+ categories, maybe is there any list of "directions" aka "links" of how categories are connected?
Actually, there seem to be 1M+ categories (though not all of them are normal article categories). And category-subcategory links is exactly
what
the categorylinks dump contains, along with category-page links.
Petr Onderka [[en:User:Svick]]
Mediawiki-api mailing list Mediawiki-api@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/mediawiki-api
-- Luigi Assom
Skype contact: oggigigi
-- Luigi Assom
Skype contact: oggigigi
Mediawiki-api mailing list Mediawiki-api@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/mediawiki-api
Mediawiki-api mailing list Mediawiki-api@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/mediawiki-api
Hi Yury, I already tried this and there are quiet a few links missing from some categories to its parent categories. So i opted not to use dbpedia.
Thanks, Pavan
On Thu, May 30, 2013 at 11:51 AM, Yury Katkov katkov.juriy@gmail.comwrote:
The simplest way is of course to use Dbpedia for that and create a SPARQL query that will give you all the skos:broader terms, e.g. http://dbpedia.org/page/Category:Italian_Roman_Catholics
Yury Katkov, WikiVote
On Thu, May 30, 2013 at 1:05 PM, Luigi Assom luigi.assom@gmail.com wrote:
Thank you Mark,
I was indeed looking for the tables to construct that structure. You also confirm categorylinks and categorypages are enough?
On Thu, May 30, 2013 at 10:49 AM, Luigi Assom luigi.assom@gmail.com
wrote:
Oh I see! (50K+ was based on a hold paper I read... 1M+ hehe wiki is growing :) )
thank you for the suggestions to the tables!
On Thu, May 30, 2013 at 1:33 AM, Petr Onderka gsvick@gmail.com wrote:
actually is there a schema of the graph? you said "directed", hence there should some hierarchy in it too.'
I'm not sure I understand what you're asking, but the hierarchy is
based
on the relation between a category and its subcategories. In more technical terms, in the directed graph of categories, edges are from a category to its subcategories.
I read there are 50K+ categories, maybe is there any list of "directions" aka "links" of how categories are connected?
Actually, there seem to be 1M+ categories (though not all of them are normal article categories). And category-subcategory links is exactly
what
the categorylinks dump contains, along with category-page links.
Petr Onderka [[en:User:Svick]]
Mediawiki-api mailing list Mediawiki-api@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/mediawiki-api
-- Luigi Assom
Skype contact: oggigigi
-- Luigi Assom
Skype contact: oggigigi
Mediawiki-api mailing list Mediawiki-api@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/mediawiki-api
Mediawiki-api mailing list Mediawiki-api@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/mediawiki-api
Hi again guys! I am still digging into wiki categories. As Brad said, there is not an hierarchy. I basically extracted categories from categorilinks.sql
I have questions about the results I obtained, which is in the form #id_source_category title_target_category Problem I have is that the directed graph is actually "mixed": sometimes you have a real sub-category pointing to a parent, sometimes viceversa so I can't understand the importance
- is there (maybe in another dump) a mark or parameter (smtg like "ns" maybe ? ) telling you which directory is head respect to another? As example, take "http://en.wikipedia.org/wiki/Category:World_War_II" It is written "this is root category...."
- the other question is about results: http://en.wikipedia.org/wiki/Category:World_War_II collects for sub-categories which I have not found in categorilinks.sql e.g. for WWII I otain the conflicts but not other ones.
Am I missing smthing ?
On Thu, May 30, 2013 at 7:49 PM, Pavan Kapanipathi pavan@knoesis.org wrote:
Hi Yury, I already tried this and there are quiet a few links missing from some categories to its parent categories. So i opted not to use dbpedia.
Thanks, Pavan
On Thu, May 30, 2013 at 11:51 AM, Yury Katkov katkov.juriy@gmail.com wrote:
The simplest way is of course to use Dbpedia for that and create a SPARQL query that will give you all the skos:broader terms, e.g. http://dbpedia.org/page/Category:Italian_Roman_Catholics
Yury Katkov, WikiVote
On Thu, May 30, 2013 at 1:05 PM, Luigi Assom luigi.assom@gmail.com wrote:
Thank you Mark,
I was indeed looking for the tables to construct that structure. You also confirm categorylinks and categorypages are enough?
On Thu, May 30, 2013 at 10:49 AM, Luigi Assom luigi.assom@gmail.com wrote:
Oh I see! (50K+ was based on a hold paper I read... 1M+ hehe wiki is growing :) )
thank you for the suggestions to the tables!
On Thu, May 30, 2013 at 1:33 AM, Petr Onderka gsvick@gmail.com wrote:
actually is there a schema of the graph? you said "directed", hence there should some hierarchy in it too.'
I'm not sure I understand what you're asking, but the hierarchy is based on the relation between a category and its subcategories. In more technical terms, in the directed graph of categories, edges are from a category to its subcategories.
I read there are 50K+ categories, maybe is there any list of "directions" aka "links" of how categories are connected?
Actually, there seem to be 1M+ categories (though not all of them are normal article categories). And category-subcategory links is exactly what the categorylinks dump contains, along with category-page links.
Petr Onderka [[en:User:Svick]]
Mediawiki-api mailing list Mediawiki-api@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/mediawiki-api
-- Luigi Assom
Skype contact: oggigigi
-- Luigi Assom
Skype contact: oggigigi
Mediawiki-api mailing list Mediawiki-api@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/mediawiki-api
Mediawiki-api mailing list Mediawiki-api@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/mediawiki-api
Mediawiki-api mailing list Mediawiki-api@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/mediawiki-api
I have questions about the results I obtained, which is in the form #id_source_category title_target_category Problem I have is that the directed graph is actually "mixed": sometimes you have a real sub-category pointing to a parent, sometimes viceversa so I can't understand the importance
That's not true. Each link in categorylinks is from a page (cl_from, this can be a category page) to its parent category (cl_to).
- is there (maybe in another dump) a mark or parameter (smtg like "ns"
maybe ? ) telling you which directory is head respect to another? As example, take "http://en.wikipedia.org/wiki/Category:World_War_II" It is written "this is root category...."
That text just describes what belongs to this category. It has nothing to do with the structure, that's always the same.
- the other question is about results:
http://en.wikipedia.org/wiki/Category:World_War_II collects for sub-categories which I have not found in categorilinks.sql e.g. for WWII I otain the conflicts but not other ones.
If I look at the categorylinks for that category (cl_from = 690451) in the latest dump, I get some “conflicts” categories, then some “Wars involving X” categories and also few others, like “Modern Europe”. I don't see any parent category that would be missing.
Maybe it would help if you described the issue in more detail: what results exactly are you getting, what results are you expecting and how do the two differ.
Petr Onderka [[en:User:Svick]]
Hello Petr,
Thank you.
I want to obtain this result: a list of categories such as: parent category -> subcategory
That's not true. Each link in categorylinks is from a page (cl_from, this can be a category page) to its parent category (cl_to).
Indeed I followed this structure, selecting only category pages and I obtained e.g. 690070 (sub category "Futurama", cl_from) to parent category "Comic_science_fiction" e.g. 690451 (sub category "World War II", cl_from) to parent category "Conflicts_in_1939"
so I my perception is the first example is correct, instead World War II should be parent of Conflicts_in_1939
Am I right or am I missing / misunderstood something?
If I look at the categorylinks for that category (cl_from = 690451) in the latest dump, I get some “conflicts” categories, then some “Wars involving X” categories and also few others, like “Modern Europe”. I don't see any parent category that would be missing.
That's correct for Modern Europe but if you compare with the page http://en.wikipedia.org/wiki/Category:World_War_II you see some links are missing in the sql dump, such as: http://en.wikipedia.org/wiki/Category:Military_deception_during_World_War_II or http://en.wikipedia.org/wiki/Category:Sociology_of_World_War_II
This is the list I obtain: the above sub categories are not present:
690451 1930s_conflicts 690451 1940s_conflicts 690451 20th-century_conflicts 690451 Categories_requiring_diffusion 690451 Conflicts_in_1939 690451 Conflicts_in_1940 690451 Conflicts_in_1941 690451 Conflicts_in_1942 690451 Conflicts_in_1943 690451 Conflicts_in_1944 690451 Conflicts_in_1945 690451 Global_conflicts 690451 Modern_Europe 690451 Wars_involving_Albania 690451 Wars_involving_Argentina 690451 Wars_involving_Australia 690451 Wars_involving_Austria 690451 Wars_involving_ .... [etc.]
Indeed I followed this structure, selecting only category pages and I obtained e.g. 690070 (sub category "Futurama", cl_from) to parent category "Comic_science_fiction" e.g. 690451 (sub category "World War II", cl_from) to parent category "Conflicts_in_1939"
so I my perception is the first example is correct, instead World War II should be parent of Conflicts_in_1939
Am I right or am I missing / misunderstood something?
I don't understand why the two should be different. Futurama is a comic science fiction, so Category:Futurama is a subcategory of Category:Comic science fiction. World War II is a conflict that was in progress in 1939, so Category:World War II is a subcategory of Category:Conflicts in 1939. This seems consistent to me.
If I look at the categorylinks for that category (cl_from = 690451) in
the
latest dump, I get some “conflicts” categories, then some “Wars
involving X”
categories and also few others, like “Modern Europe”. I don't see any
parent
category that would be missing.
That's correct for Modern Europe but if you compare with the page http://en.wikipedia.org/wiki/Category:World_War_II you see some links are missing in the sql dump, such as:
http://en.wikipedia.org/wiki/Category:Military_deception_during_World_War_II or http://en.wikipedia.org/wiki/Category:Sociology_of_World_War_II
What you got was a list of parent categories of Category:World War II. But Category:Military deception during World War II is a *subcategory* of Category:World War II, not a parent category.
If you want to get a list of subcategories, you will need to search for something like page_namespace = 14 AND cl_to = 'World_War_II'.
Petr Onderka [[en:User:Svick]]
Ok Petr, thank you for clarification. I will also look for page_namespace = 14 and cl_to
P.s. one more question: on enwiki-20130503-pages-articles-multistream-index there are some entries reporting titles with no spaces, such as: 549:10:AccessibleComputing
Why are they not reported as Accessible_Computing, making it consistent with other dump files (e.g. enwiki-20130503-all-titles-in-ns0) ? Is there a special meaning for stripping the blank space?
On Thu, Jun 6, 2013 at 4:13 PM, Petr Onderka gsvick@gmail.com wrote:
Indeed I followed this structure, selecting only category pages and I obtained e.g. 690070 (sub category "Futurama", cl_from) to parent category "Comic_science_fiction" e.g. 690451 (sub category "World War II", cl_from) to parent category "Conflicts_in_1939"
so I my perception is the first example is correct, instead World War II should be parent of Conflicts_in_1939
Am I right or am I missing / misunderstood something?
I don't understand why the two should be different. Futurama is a comic science fiction, so Category:Futurama is a subcategory of Category:Comic science fiction. World War II is a conflict that was in progress in 1939, so Category:World War II is a subcategory of Category:Conflicts in 1939. This seems consistent to me.
If I look at the categorylinks for that category (cl_from = 690451) in the latest dump, I get some “conflicts” categories, then some “Wars involving X” categories and also few others, like “Modern Europe”. I don't see any parent category that would be missing.
That's correct for Modern Europe but if you compare with the page http://en.wikipedia.org/wiki/Category:World_War_II you see some links are missing in the sql dump, such as:
http://en.wikipedia.org/wiki/Category:Military_deception_during_World_War_II or http://en.wikipedia.org/wiki/Category:Sociology_of_World_War_II
What you got was a list of parent categories of Category:World War II. But Category:Military deception during World War II is a *subcategory* of Category:World War II, not a parent category.
If you want to get a list of subcategories, you will need to search for something like page_namespace = 14 AND cl_to = 'World_War_II'.
Petr Onderka [[en:User:Svick]]
Mediawiki-api mailing list Mediawiki-api@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/mediawiki-api
On Thu, Jun 6, 2013 at 1:20 PM, Luigi Assom luigi.assom@gmail.com wrote:
Ok Petr, thank you for clarification. I will also look for page_namespace = 14 and cl_to
P.s. one more question: on enwiki-20130503-pages-articles-multistream-index there are some entries reporting titles with no spaces, such as: 549:10:AccessibleComputing
Why are they not reported as Accessible_Computing, making it consistent with other dump files (e.g. enwiki-20130503-all-titles-in-ns0) ? Is there a special meaning for stripping the blank space?
The page "AccessibleComputing" (https://en.wikipedia.org/w/index.php?title=AccessibleComputing&redirect=...) is a redirect to the article "Computer accessibility" (https://en.wikipedia.org/wiki/AccessibleComputing). The very oldest version of the software that runs Wikipedia didn't support names with spaces (the "AccessibleComputing" article dates from January 2001), so there are some redirects with names like this.
The redirects "Accessible Computing" (https://en.wikipedia.org/w/index.php?title=Accessible_Computing&redirect...) and "Accessible computing" (https://en.wikipedia.org/w/index.php?title=Accessible_computing&redirect...) also exist, and are distinct pages with their own histories.
oh thank you!! very useful indeed!
On Thu, Jun 6, 2013 at 10:55 PM, Mark Wagner carnildo@gmail.com wrote:
On Thu, Jun 6, 2013 at 1:20 PM, Luigi Assom luigi.assom@gmail.com wrote:
Ok Petr, thank you for clarification. I will also look for page_namespace = 14 and cl_to
P.s. one more question: on enwiki-20130503-pages-articles-multistream-index there are some entries reporting titles with no spaces, such as: 549:10:AccessibleComputing
Why are they not reported as Accessible_Computing, making it consistent with other dump files (e.g. enwiki-20130503-all-titles-in-ns0) ? Is there a special meaning for stripping the blank space?
The page "AccessibleComputing" (https://en.wikipedia.org/w/index.php?title=AccessibleComputing&redirect=...) is a redirect to the article "Computer accessibility" (https://en.wikipedia.org/wiki/AccessibleComputing). The very oldest version of the software that runs Wikipedia didn't support names with spaces (the "AccessibleComputing" article dates from January 2001), so there are some redirects with names like this.
The redirects "Accessible Computing" (https://en.wikipedia.org/w/index.php?title=Accessible_Computing&redirect...) and "Accessible computing" (https://en.wikipedia.org/w/index.php?title=Accessible_computing&redirect...) also exist, and are distinct pages with their own histories.
-- Mark Wagner
Mediawiki-api mailing list Mediawiki-api@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/mediawiki-api
I will treat them as distinct cases then, thank you Petr!
On Thu, Jun 6, 2013 at 11:37 PM, Luigi Assom luigi.assom@gmail.com wrote:
oh thank you!! very useful indeed!
On Thu, Jun 6, 2013 at 10:55 PM, Mark Wagner carnildo@gmail.com wrote:
On Thu, Jun 6, 2013 at 1:20 PM, Luigi Assom luigi.assom@gmail.com wrote:
Ok Petr, thank you for clarification. I will also look for page_namespace = 14 and cl_to
P.s. one more question: on enwiki-20130503-pages-articles-multistream-index there are some entries reporting titles with no spaces, such as: 549:10:AccessibleComputing
Why are they not reported as Accessible_Computing, making it consistent with other dump files (e.g. enwiki-20130503-all-titles-in-ns0) ? Is there a special meaning for stripping the blank space?
The page "AccessibleComputing" (https://en.wikipedia.org/w/index.php?title=AccessibleComputing&redirect=...) is a redirect to the article "Computer accessibility" (https://en.wikipedia.org/wiki/AccessibleComputing). The very oldest version of the software that runs Wikipedia didn't support names with spaces (the "AccessibleComputing" article dates from January 2001), so there are some redirects with names like this.
The redirects "Accessible Computing" (https://en.wikipedia.org/w/index.php?title=Accessible_Computing&redirect...) and "Accessible computing" (https://en.wikipedia.org/w/index.php?title=Accessible_computing&redirect...) also exist, and are distinct pages with their own histories.
-- Mark Wagner
Mediawiki-api mailing list Mediawiki-api@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/mediawiki-api
-- Luigi Assom
Skype contact: oggigigi
On 30/05/13 11:05, Luigi Assom wrote:
Thank you Mark,
I was indeed looking for the tables to construct that structure. You also confirm categorylinks and categorypages are enough?
You only need the tables categorylinks and page [entries with page_namespace = 14 (aka. NS_CATEGORY) ]
"Directed cyclic graph" is a mathematical term describing the nature of the Wikipedia category system. In non-mathematical terms, "graph" means it's a collection of objects (categories) with links ("category A is a subcategory of category B") between them, "cyclic" means that there are places where you can go from "A" to "B" to "C" to "A" by following those links (a category can be a subcategory of itself), and "directed" means those links are not symmetrical (that is, if category "A" is a subcategory of "B", it doesn't mean that "B" is a subcategory of "A").
mediawiki-api@lists.wikimedia.org