Re: [Analytics] Distributing an official graph

11 Dec 2013

...
  And the ids of categories overlap with the ids of
pages. 
Categories don't actually have ids.  Categories are more like tags in that
they exist as soon as a page is "linked" to one.  Many categories have
corresponding pages in the "Category" namespace that describe them, but a
category "exists" before a page is created.

...
  The best thing, from a computational perspective, is
that if n is the number of pages plus the number of category pages every page or
category
page is assigned a node number in the interval [0..n).

Surely you can build a hash map on whatever unique identifier you like and
get constant (amortized) lookup speed.

I think it is best if we provide you with a raw format that will work and
you do your own post processing to obtain the "id space" that you like.

-Aaron

On Tue, Dec 10, 2013 at 10:09 PM, Sebastiano Vigna &lt;vigna(a)di.unimi.it&gt;wrote;wrote:

> On 10 Dec 2013, at 9:46 PM, Aaron Halfaker &lt;ahalfaker(a)wikimedia.org&gt;
> wrote:
>
> > I'm not sure what you are referring to when you say "id space".  A
page
> can be identified by it's page_id or the pair: (page_namespace,
> page_title).  A category can be identified by its title.  You could
> enumerate them how you like post-hoc.
>
> The set of page ids in the SQL dump is not contiguous. And the ids of
> categories overlap with the ids of pages.
>
...
  The best thing, from a computational perspective, is
that if n is the > number of pages plus the number of category pages every page
or category
> page is assigned a node number in the interval [0..n). To make thing easier
> (in particular, compression of strings) it might be useful to assign these
> node numbers by listing titles lexicographically, maybe first enumerating
> category titles and then enumerating page titles.
>
> My idea would be distributing a graph in binary compressed format with a
> compact id space [0..n), together with a bidirectional map title <-> node
> numbers. The map would be the bridge between the graph and the SQL
> dumps/Wikipedia text.
>
> > > Someone previously asked for temporal data. How can we get access to
> that?
> >
> > It doesn't exist.  We could start recording a history now, but without a
> clear use-case, I'm not sure it's worth the time.
>
> I thought from previous comments that it was available--my mistake!
>
> Ciao,
>
>                                         seba
>
>
> _______________________________________________
> Analytics mailing list
> Analytics(a)lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics
>

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

Re: [Analytics] Distributing an official graph