And the ids of categories overlap with the ids of pages.

Categories don't actually have ids.  Categories are more like tags in that they exist as soon as a page is "linked" to one.  Many categories have corresponding pages in the "Category" namespace that describe them, but a category "exists" before a page is created. 

The best thing, from a computational perspective, is that if n is the number of pages plus the number of category pages every page or category page is assigned a node number in the interval [0..n).

Surely you can build a hash map on whatever unique identifier you like and get constant (amortized) lookup speed.  

I think it is best if we provide you with a raw format that will work and you do your own post processing to obtain the "id space" that you like.

-Aaron


On Tue, Dec 10, 2013 at 10:09 PM, Sebastiano Vigna <vigna@di.unimi.it> wrote:
On 10 Dec 2013, at 9:46 PM, Aaron Halfaker <ahalfaker@wikimedia.org> wrote:

> I'm not sure what you are referring to when you say "id space".  A page can be identified by it's page_id or the pair: (page_namespace, page_title).  A category can be identified by its title.  You could enumerate them how you like post-hoc.

The set of page ids in the SQL dump is not contiguous. And the ids of categories overlap with the ids of pages.

The best thing, from a computational perspective, is that if n is the number of pages plus the number of category pages every page or category page is assigned a node number in the interval [0..n). To make thing easier (in particular, compression of strings) it might be useful to assign these node numbers by listing titles lexicographically, maybe first enumerating category titles and then enumerating page titles.

My idea would be distributing a graph in binary compressed format with a compact id space [0..n), together with a bidirectional map title <-> node numbers. The map would be the bridge between the graph and the SQL dumps/Wikipedia text.

> > Someone previously asked for temporal data. How can we get access to that?
>
> It doesn't exist.  We could start recording a history now, but without a clear use-case, I'm not sure it's worth the time.

I thought from previous comments that it was available--my mistake!

Ciao,

                                        seba


_______________________________________________
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics