And the ids of categories overlap with the ids of
pages.
Categories don't actually have ids. Categories are more like tags in that
they exist as soon as a page is "linked" to one. Many categories have
corresponding pages in the "Category" namespace that describe them, but a
category "exists" before a page is created.
The best thing, from a computational perspective, is
that if n is the
number of pages plus the number of category pages every page or
category
page is assigned a node number in the interval [0..n).
Surely you can build a hash map on whatever unique identifier you like and
get constant (amortized) lookup speed.
I think it is best if we provide you with a raw format that will work and
you do your own post processing to obtain the "id space" that you like.
-Aaron
On Tue, Dec 10, 2013 at 10:09 PM, Sebastiano Vigna <vigna(a)di.unimi.it>wrote;wrote:
> On 10 Dec 2013, at 9:46 PM, Aaron Halfaker <ahalfaker(a)wikimedia.org>
> wrote:
>
> > I'm not sure what you are referring to when you say "id space". A
page
> can be identified by it's page_id or the pair: (page_namespace,
> page_title). A category can be identified by its title. You could
> enumerate them how you like post-hoc.
>
> The set of page ids in the SQL dump is not contiguous. And the ids of
> categories overlap with the ids of pages.
>
The best thing, from a computational perspective, is
that if n is the
> number of pages plus the number of category pages every page
or category
> page is assigned a node number in the interval [0..n). To make thing easier
> (in particular, compression of strings) it might be useful to assign these
> node numbers by listing titles lexicographically, maybe first enumerating
> category titles and then enumerating page titles.
>
> My idea would be distributing a graph in binary compressed format with a
> compact id space [0..n), together with a bidirectional map title <-> node
> numbers. The map would be the bridge between the graph and the SQL
> dumps/Wikipedia text.
>
> > > Someone previously asked for temporal data. How can we get access to
> that?
> >
> > It doesn't exist. We could start recording a history now, but without a
> clear use-case, I'm not sure it's worth the time.
>
> I thought from previous comments that it was available--my mistake!
>
> Ciao,
>
> seba
>
>
> _______________________________________________
> Analytics mailing list
> Analytics(a)lists.wikimedia.org
>
https://lists.wikimedia.org/mailman/listinfo/analytics
>