On Thu, 19 Dec 2013 08:03:20 -0800
Sebastiano Vigna <vigna(a)di.unimi.it> wrote:
On 19 Dec 2013, at 4:28 AM, Johannes Kroll
<johannes.kroll(a)wikimedia.de> wrote:
Yes. In the link I posted in the mail you quoted,
there's an
example query including a set operation. The timing includes setting up
the connection, doing the two queries and the set operation, converting
the result to the line-based format, and transferring that over HTTP.
This is a real-world query, and about the same as you would get in a
tool that runs on Labs which uses CatGraph (minus the overhead from
starting the Curl binary, setting up the connection, and the slight
overhead from HTTP, because you would use plain TCP transfers in such a
tool). You can login to Tool Labs and try various queries yourself.
"Real-world" means little: the world is large--people have different needs.
Having the graph, embedded, with high-access speed is different than depending on a
service with a fixed number of primitives.
What if I want to know a centrality measure? Will you implement it for me?
If you want to do centrality measures on *current* data, you will either
have to implement it yourself (CatGraph is free software after all), or
wait for me to implement it.
If I have to fetch successor lists and compute it by
myself it will be 100-1000x slower. If I ask for a successor list, how much time per arc,
overall, will it take? This is the standard measure for the speed of a graph
representation. I can't evince anything from the example you quote.
I think you can. You can even run any query yourself. Try something
like:
curl
http://sylvester.wmflabs.org:8090/dewiki/traverse-successors+235276+9999 | head
You'll get something like:
OK. 102243 nodes, 0.160605s:
235276
338
464
1704
[...]
Now you can calculate your time per arc. This, again, includes setting
up the connection and converting the numbers to a line-based format for
transfer. So it isn't directly comparable to your numbers. And what
for, anyway? It doesn't make *any* difference to a user of a web tool
whether they wait for 0.01 seconds, or 0.09 seconds longer. I can
hardly justify spending any more time on optimizing this, since there
is no point.
You didn't mention how I could reproduce your 50ns number.
"Deliver
an edge in 50ns" sounds impressive, but this value doesn't
mean much without context. What does it mean?
Most basic graph traversal algorithms are linear in the number of arcs traversed. Thus, a
standard and informative measure of the speed of access to a graph (see any paper on the
subject) is how much time it takes to get a successor (say, from an iterator providing the
successors of a node). You don't need any context.
Note that it's you claiming that CatGraph is the service I need.
No, not at all. I saw this thread and said you might be interested, and
I wasn't addressing *you* or any single person in particular. I didn't
want to start a pissing contest either. We are providing a service that
enables tool authors and others to traverse graphs representing
*current* data from Wikipedia quickly and easily. We need an
implementation that scales well and can have its data constantly
updated, in a central place, without using too much memory while it
runs. You, on the other hand, want to do research on data that can be
months old. This is a different use case. But we already said that.
Both. Twice.
--
Johannes Kroll
Softwareentwickler
Wikimedia Deutschland e.V. | Tempelhofer Ufer 23-24 | 10963 Berlin
Tel. (030) 219 158 26-0
http://wikimedia.de
Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e.V.
Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg
unter der Nummer 23855 B. Als gemeinnützig anerkannt durch das
Finanzamt für Körperschaften I Berlin, Steuernummer 27/681/51985.