I'm not sure what you are referring to when you say "id space". A page can
be identified by it's page_id or the pair: (page_namespace, page_title). A
category can be identified by its title. You could enumerate them how you
like post-hoc.
Someone previously asked for temporal data. How can we
get access to that?
It doesn't exist. We could start recording a history now, but without a
clear use-case, I'm not sure it's worth the time.
-Aaron
On Tue, Dec 10, 2013 at 5:38 PM, Sebastiano Vigna <vigna(a)di.unimi.it> wrote:
> On 10 Dec 2013, at 9:46 AM, Aaron Halfaker <ahalfaker(a)wikimedia.org>
> wrote:
>
> > This request seems like it could be easy to fulfill. Am I understanding
> correctly that the dataset being sought would simply contain a list of
> pairs of pages (in the cases of internal links) and a list of page/category
> pairs (in the case of categorization)?
>
> Yes, but one thing that must be done is normalize the id space. Presently
> category and pages have overlapping id spaces. They're also non-contiguous,
> which is a pain for running, say, any ranking algorithm.
>
> > We can simply just dump out the categorylinks and pagelinks tables in
> order to meet these needs. I understand that the SQL dump format is
> painful to deal with, but a plain CSV/TSV format should be fine. The mysql
> cleint will created a solid TSV format if you just pipe a query to the
> client and stream that out to a file (or better yet, bzip and then a file).
>
> We would need to 1) decide an id space organization. 2) dump data
> translating it into the id space. 3) build the graph.
>
> For 1) I think it would be good to have, like, spaces [0..x) [x..y), one
> for categories and one for pages. For 2) it's just a matter of a Java class
> fiddling with the ids and the titles (the format for link is asymmetrical).
> For 3) I'd love to release a binary compressed version because it takes
> much less space, it is immediately usable and if you want to dump the pairs
> <x,y> in ASCII is just a single command line.
>
> > These two tables track the most recent categorization/link structure of
> pages, so we wouldn't be able to use them historically.
>
Someone previously asked for temporal data. How can we
get access to that?
> We might provide a label file with on-off dates for every,
say, category
> link.
>
> Ciao,
>
> seba
>
>
> _______________________________________________
> Analytics mailing list
> Analytics(a)lists.wikimedia.org
>
https://lists.wikimedia.org/mailman/listinfo/analytics
>