dataset to miss some trivial detail and then things go amok
Yes. It would be nice if we didn't have to manage such details.
phases--we should hide them.
I don't think that hiding IDs that could be used to, say, look up content
or compare against user contrib histories is a good idea.
-Aaron
On Tue, Dec 10, 2013 at 10:48 PM, Sebastiano Vigna <vigna(a)di.unimi.it>wrote;wrote:
On 10 Dec 2013, at 10:39 PM, Aaron Halfaker
<ahalfaker(a)wikimedia.org>
wrote:
Categories don't actually have ids.
Categories are more like tags in
that they exist as soon as a page is
"linked" to one. Many categories have
corresponding pages in the "Category" namespace that describe them, but a
category "exists" before a page is created.
OK. Let's say than that it would be nice to have a graph containing all
noncategory-pages and all categories. Probably the links from/to the
categories should be just the category link (even if a category with a page
might contain other linked content).
> The best thing, from a computational
perspective, is that if n is the
number of pages plus the number of category pages
every page or category
page is assigned a node number in the interval [0..n).
Surely you can build a hash map on whatever unique identifier you like
and get
constant (amortized) lookup speed.
FYI, that was before cuckoo hashing. We have now dictionaries with actual
constant lookup speed (not amortized).
I think it is best if we provide you with a raw
format that will work
and you do your own post processing to obtain the "id
space" that you like.
It depends what you mean with "you". :) My idea is that it would have been
nice to have a "gold standard" Wikipedia graph so that there is no
postprocessing done by the end user. This is in the interest of
reproducibility--it's very easy with large dataset to miss some trivial
detail and then things go amok. I would be happy to help to make this
process to be efficient and convenient so that it can be performed at each
dump.
That's also the purpose of deciding a convenient ID space and then give
away a bidirectional link map. The ids inside the SQL tables are artifacts
of Wikipedia's construction phases--we should hide them.
Ciao,
seba
_______________________________________________
Analytics mailing list
Analytics(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics