On 10 Dec 2013, at 10:39 PM, Aaron Halfaker <ahalfaker@wikimedia.org> wrote:OK. Let's say than that it would be nice to have a graph containing all noncategory-pages and all categories. Probably the links from/to the categories should be just the category link (even if a category with a page might contain other linked content).
> Categories don't actually have ids. Categories are more like tags in that they exist as soon as a page is "linked" to one. Many categories have corresponding pages in the "Category" namespace that describe them, but a category "exists" before a page is created.
FYI, that was before cuckoo hashing. We have now dictionaries with actual constant lookup speed (not amortized).
> > The best thing, from a computational perspective, is that if n is the number of pages plus the number of category pages every page or category page is assigned a node number in the interval [0..n).
>
> Surely you can build a hash map on whatever unique identifier you like and get constant (amortized) lookup speed.
It depends what you mean with "you". :) My idea is that it would have been nice to have a "gold standard" Wikipedia graph so that there is no postprocessing done by the end user. This is in the interest of reproducibility--it's very easy with large dataset to miss some trivial detail and then things go amok. I would be happy to help to make this process to be efficient and convenient so that it can be performed at each dump.
> I think it is best if we provide you with a raw format that will work and you do your own post processing to obtain the "id space" that you like.
That's also the purpose of deciding a convenient ID space and then give away a bidirectional link map. The ids inside the SQL tables are artifacts of Wikipedia's construction phases--we should hide them.
Ciao,
seba
_______________________________________________
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics