On Tue, Dec 10, 2013 at 10:46 AM, Aaron Halfaker <ahalfaker(a)wikimedia.org>wrote;wrote:
This request seems like it could be easy to fulfill.
Am I understanding
correctly that the dataset being sought would simply contain a list of
pairs of pages (in the cases of internal links) and a list of page/category
pairs (in the case of categorization)?
We can simply just dump out the categorylinks and pagelinks tables in
order to meet these needs. I understand that the SQL dump format is
painful to deal with, but a plain CSV/TSV format should be fine. The mysql
cleint will created a solid TSV format if you just pipe a query to the
client and stream that out to a file (or better yet, bzip and then a file).
Agreed, dumping the data seems straightforward. You could write a script
using mysqldump on the labs db instances. It could update the graph
definition daily for example.
These two tables track the most recent
categorization/link structure of
pages, so we wouldn't be able to use them historically.
I think as soon as we dump out the graph, we are adding a time dimension to
it. Any analysis using the graph would have to take this into account.
Analyses that need to look at things over time would then run into the
problem of accessing historical versions of the graph. So it seems like
the smart thing to do for now is just ignore that until it comes up :)