This is in the interest of reproducibility--it's very easy with large dataset to miss some trivial detail and then things go amok

Yes.  It would be nice if we didn't have to manage such details. 

The ids inside the SQL tables are artifacts of Wikipedia's construction phases--we should hide them.

I don't think that hiding IDs that could be used to, say, look up content or compare against user contrib histories is a good idea. 

-Aaron


On Tue, Dec 10, 2013 at 10:48 PM, Sebastiano Vigna <vigna@di.unimi.it> wrote:
On 10 Dec 2013, at 10:39 PM, Aaron Halfaker <ahalfaker@wikimedia.org> wrote:

> Categories don't actually have ids.  Categories are more like tags in that they exist as soon as a page is "linked" to one.  Many categories have corresponding pages in the "Category" namespace that describe them, but a category "exists" before a page is created.

OK. Let's say than that it would be nice to have a graph containing all noncategory-pages and all categories. Probably the links from/to the categories should be just the category link (even if a category with a page might contain other linked content).

> > The best thing, from a computational perspective, is that if n is the number of pages plus the number of category pages every page or category page is assigned a node number in the interval [0..n).
>
> Surely you can build a hash map on whatever unique identifier you like and get constant (amortized) lookup speed.

FYI, that was before cuckoo hashing. We have now dictionaries with actual constant lookup speed (not amortized).

> I think it is best if we provide you with a raw format that will work and you do your own post processing to obtain the "id space" that you like.

It depends what you mean with "you". :) My idea is that it would have been nice to have a "gold standard" Wikipedia graph so that there is no postprocessing done by the end user. This is in the interest of reproducibility--it's very easy with large dataset to miss some trivial detail and then things go amok. I would be happy to help to make this process to be efficient and convenient so that it can be performed at each dump.

That's also the purpose of deciding a convenient ID space and then give away a bidirectional link map. The ids inside the SQL tables are artifacts of Wikipedia's construction phases--we should hide them.

Ciao,

                                        seba


_______________________________________________
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics