the pagelinks table doesn’t record the link creation history but categorylinks does
include a timestamp of the most recent change for each individual record [1].
Dario
[1]
On Dec 10, 2013, at 8:09 PM, Sebastiano Vigna <vigna(a)di.unimi.it> wrote:
On 10 Dec 2013, at 9:46 PM, Aaron Halfaker
<ahalfaker(a)wikimedia.org> wrote:
I'm not sure what you are referring to when
you say "id space". A page can be identified by it's page_id or the pair:
(page_namespace, page_title). A category can be identified by its title. You could
enumerate them how you like post-hoc.
The set of page ids in the SQL dump is not contiguous. And the ids of categories overlap
with the ids of pages.
The best thing, from a computational perspective, is that if n is the number of pages
plus the number of category pages every page or category page is assigned a node number in
the interval [0..n). To make thing easier (in particular, compression of strings) it might
be useful to assign these node numbers by listing titles lexicographically, maybe first
enumerating category titles and then enumerating page titles.
My idea would be distributing a graph in binary compressed format with a compact id space
[0..n), together with a bidirectional map title <-> node numbers. The map would be
the bridge between the graph and the SQL dumps/Wikipedia text.
Someone
previously asked for temporal data. How can we get access to that?
It doesn't exist. We could start recording a history now, but without a clear
use-case, I'm not sure it's worth the time.
I thought from previous comments that it was available--my mistake!
Ciao,
seba
_______________________________________________
Analytics mailing list
Analytics(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics