Hi,
On Sun, 22 Dec 2013 13:42:25 +0100
Maarten Dammers <maarten(a)mdammers.nl> wrote:
[...]
Use cases (all at Wikimedia Commons):
* Make a report of loops. I want a report for each length: 0 being self
categorized, 1 being A->B->A, 2 being A->B->C->A etc.
* Give a number of categories and filter out overcategorization
(Category:Berlin + Category:Germany -> Category:Berlin)
Can this be done with catgraph?
some progress on this: I imported Commons categories. To find cycles,
try this:
http://tools.wmflabs.org/render-tests/catcycle-dev/catcycle.py?action=find-…
This uses CommonsRoot, which is supposed to be "the top level node in
Commons tree data structure from which every other category is
accessible" as a start node, and looks for cycles with (practically)
unlimited depth. From the amount of nodes visited, it looks like about
2500 categories are not reachable from CommonsRoot (total Commons
categories: 3163665, nodes visited: 3166147). So the query should find
most of the cycles, but possibly not every last one of them. It's a
good start to look for cycles in any case. The page does take about half
a minute or so to load.
One way to find some unreachable categories is the "Find root nodes"
function in CatCycle. Root nodes are categories without any parent
category. As I understand it, there shouldn't be any root nodes except
CommonsRoot by definition.
I've set the refresh interval of the Commons graph to 4 hours, so this
is the maximum age of the data you get (plus replication lag of the
Labs replica). I can shorten that if it's needed.
Does that help so far?
For intersections of categories of actual files, it's different.
Currently we have one VM where we keep all graphs we imported. Commons
is rather large, and the whole graph including leaves (=files) won't
fit into the RAM of this host. This means I will have to distribute the
graphs into several VMs first. This was planned anyway, I just haven't
done it yet.
Cheers,
Johannes
--
Johannes Kroll
Softwareentwickler
Wikimedia Deutschland e.V. | Tempelhofer Ufer 23-24 | 10963 Berlin
Tel. (030) 219 158 26-0
http://wikimedia.de
Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e.V.
Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg
unter der Nummer 23855 B. Als gemeinnützig anerkannt durch das
Finanzamt für Körperschaften I Berlin, Steuernummer 27/681/51985.