On 10 Dec 2013, at 10:39 PM, Aaron Halfaker <ahalfaker(a)wikimedia.org> wrote:
Categories don't actually have ids. Categories
are more like tags in that they exist as soon as a page is "linked" to one.
Many categories have corresponding pages in the "Category" namespace that
describe them, but a category "exists" before a page is created.
OK. Let's say than that it would be nice to have a graph containing all
noncategory-pages and all categories. Probably the links from/to the categories should be
just the category link (even if a category with a page might contain other linked
content).
The best
thing, from a computational perspective, is that if n is the number of pages plus the
number of category pages every page or category page is assigned a node number in the
interval [0..n).
Surely you can build a hash map on whatever unique identifier you like and get constant
(amortized) lookup speed.
FYI, that was before cuckoo hashing. We have now dictionaries with actual constant lookup
speed (not amortized).
I think it is best if we provide you with a raw format
that will work and you do your own post processing to obtain the "id space" that
you like.
It depends what you mean with "you". :) My idea is that it would have been nice
to have a "gold standard" Wikipedia graph so that there is no postprocessing
done by the end user. This is in the interest of reproducibility--it's very easy with
large dataset to miss some trivial detail and then things go amok. I would be happy to
help to make this process to be efficient and convenient so that it can be performed at
each dump.
That's also the purpose of deciding a convenient ID space and then give away a
bidirectional link map. The ids inside the SQL tables are artifacts of Wikipedia's
construction phases--we should hide them.
Ciao,
seba