Hi Maarten,
On Sun, 15 Dec 2013 12:30:24 +0100
Maarten Dammers <maarten(a)mdammers.nl> wrote:
Hi everyone,
I've been playing around with the structure of category graph for quite
some time for
https://commons.wikimedia.org/wiki/User:CategorizationBot
. This bot takes an uncategorized image, looks where it is used and
tries to find relevant categories at Commons. It applies some filters
and one of them is the filter against over categorization
(
https://commons.wikimedia.org/wiki/Commons:OVERCAT). For this I create
a simple child->parent table that used to live on the Toolserver, but
now appears to have vanished.
I would love to have some sort of dump or (even better) a central
service I can query. It should contain for all Wikimedia projects:
* Page links (page A links to page B)
this should be in the pagelinks table in the database replica.
* Category links (page A is in category C)
If you need only direct relations, this is in the categorylinks table.
For recursive/transitive relations (page A is in a category that is in
category C... and so on), that's what we currently have in Catgraph for
a selection of wikis.
http://sylvester.wmflabs.org:8090/list-graphs
You can query Catgraph from your bot on Labs, or even from somewhere
else (but you need access to the db replica as well - Catgraph only
stores page_ids).
We currently have the category links, but we could include other data.
* Image links (page A uses image I)
* Interlanguage links (page A in language en links to page A' in
language nl)
* Interproject links (page A in the English Wikipedia links to page A'
on Wikimedia Commons)
These should be in the database replica, right? imagelinks,
languagelinks, iwlinks.
And to make it really complete:
* Wikidata claims (item A has a claim pointing to item B)
I'm currently working on importing transitive properties, like "is in
the administrative unit", to Catgraph. This depends on the database
dumps in the new plain json format, which should become available Any
Day Now (TM).