Re: [Analytics] Distributing an official graph

19 Dec 2013

On Tue, 17 Dec 2013 09:09:08 -0800
Sebastiano Vigna &lt;vigna(a)di.unimi.it&gt; wrote:

...
  On 17 Dec 2013, at 9:02 AM, Johannes Kroll
&lt;johannes.kroll(a)wikimedia.de&gt; wrote:

  We store page_ids only, or any other integer IDs.
Tools using it
 fetch all other data from SQL. This makes sense for Tools on Labs for
 example, which have access to the DB replica anyway. We don't compress
 anything which makes it quite fast.  
 Compression and speed are not one against another--quite the contrary. A standard
compression format from WebGraph delivers an edge in ~50ns. Frankly, any service will
requires orders of magnitude more. Do you have any timings to compare? 
Yes. In the link I posted in the mail you quoted, there's an
example query including a set operation. The timing includes setting up
the connection, doing the two queries and the set operation, converting
the result to the line-based format, and transferring that over HTTP.
This is a real-world query, and about the same as you would get in a
tool that runs on Labs which uses CatGraph (minus the overhead from
starting the Curl binary, setting up the connection, and the slight
overhead from HTTP, because you would use plain TCP transfers in such a
tool). You can login to Tool Labs and try various queries yourself.

"Deliver an edge in 50ns" sounds impressive, but this value doesn't
mean much without context. What does it mean?

...
   It isn't a
goal, the service already exists. The data you get is fresh,
 automatically updated every hour or so, unlike a graph that you would  
 This is exactly what you don't want for research purposes: moving targets. You need a
dataset, downloaded in some point in time (like Wikipedia dumps) that other people can use
to replicate or results or improve them. Anything that is updated every hour is unusable
for that purpose. It's a just a different goal.

 Once you nail down your algorithms it might be, of course, a good idea to run them on
fresh data, but research requires replicability. 
If you need information about the current state of Wikipedia, anything
that doesn't reflect the current state of Wikipedia is simply not
useful. That's the case for most maintenance tools for example. 

So yes, it is a different use case.

-- 
Johannes Kroll
Softwareentwickler

Wikimedia Deutschland e.V. | Tempelhofer Ufer 23-24 | 10963 Berlin
Tel. (030) 219 158 26-0

http://wikimedia.de

Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e.V.
Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg
unter der Nummer 23855 B. Als gemeinnützig anerkannt durch das
Finanzamt für Körperschaften I Berlin, Steuernummer 27/681/51985.

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

Re: [Analytics] Distributing an official graph