On Tue, Dec 10, 2013 at 10:46 AM, Aaron Halfaker <ahalfaker@wikimedia.org> wrote:

This request seems like it could be easy to fulfill. Am I understanding correctly that the dataset being sought would simply contain a list of pairs of pages (in the cases of internal links) and a list of page/category pairs (in the case of categorization)?

We can simply just dump out the categorylinks and pagelinks tables in order to meet these needs. I understand that the SQL dump format is painful to deal with, but a plain CSV/TSV format should be fine. The mysql cleint will created a solid TSV format if you just pipe a query to the client and stream that out to a file (or better yet, bzip and then a file).

Agreed, dumping the data seems straightforward. You could write a script using mysqldump on the labs db instances. It could update the graph definition daily for example.

These two tables track the most recent categorization/link structure of pages, so we wouldn't be able to use them historically.

I think as soon as we dump out the graph, we are adding a time dimension to it. Any analysis using the graph would have to take this into account. Analyses that need to look at things over time would then run into the problem of accessing historical versions of the graph. So it seems like the smart thing to do for now is just ignore that until it comes up :)