On 10 Dec 2013, at 9:46 AM, Aaron Halfaker <ahalfaker(a)wikimedia.org> wrote:
This request seems like it could be easy to fulfill.
Am I understanding correctly that the dataset being sought would simply contain a list of
pairs of pages (in the cases of internal links) and a list of page/category pairs (in the
case of categorization)?
Yes, but one thing that must be done is normalize the id space. Presently category and
pages have overlapping id spaces. They're also non-contiguous, which is a pain for
running, say, any ranking algorithm.
We can simply just dump out the categorylinks and
pagelinks tables in order to meet these needs. I understand that the SQL dump format is
painful to deal with, but a plain CSV/TSV format should be fine. The mysql cleint will
created a solid TSV format if you just pipe a query to the client and stream that out to a
file (or better yet, bzip and then a file).
We would need to 1) decide an id space organization. 2) dump data translating it into the
id space. 3) build the graph.
For 1) I think it would be good to have, like, spaces [0..x) [x..y), one for categories
and one for pages. For 2) it's just a matter of a Java class fiddling with the ids and
the titles (the format for link is asymmetrical). For 3) I'd love to release a binary
compressed version because it takes much less space, it is immediately usable and if you
want to dump the pairs <x,y> in ASCII is just a single command line.
These two tables track the most recent
categorization/link structure of pages, so we wouldn't be able to use them
historically.
Someone previously asked for temporal data. How can we get access to that? We might
provide a label file with on-off dates for every, say, category link.
Ciao,
seba