On 17 Dec 2013, at 9:02 AM, Johannes Kroll <johannes.kroll(a)wikimedia.de> wrote:
We store page_ids only, or any other integer IDs.
Tools using it
fetch all other data from SQL. This makes sense for Tools on Labs for
example, which have access to the DB replica anyway. We don't compress
anything which makes it quite fast.
Compression and speed are not one against another--quite the contrary. A standard
compression format from WebGraph delivers an edge in ~50ns. Frankly, any service will
requires orders of magnitude more. Do you have any timings to compare?
It isn't a goal, the service already exists. The
data you get is fresh,
automatically updated every hour or so, unlike a graph that you would
This is exactly what you don't want for research purposes: moving targets. You need a
dataset, downloaded in some point in time (like Wikipedia dumps) that other people can use
to replicate or results or improve them. Anything that is updated every hour is unusable
for that purpose. It's a just a different goal.
Once you nail down your algorithms it might be, of course, a good idea to run them on
fresh data, but research requires replicability.
download. It's as easy to use as any other
software library that you
pull into your script with "import foo". As to speed, most results are
pretty much instant. Try it:
"Instant" has for me no meaning. Can you quantify?
Ciao,
seba