On Tue, 17 Dec 2013 07:57:59 -0800
Sebastiano Vigna <vigna(a)di.unimi.it> wrote:
On 17 Dec 2013, at 6:01 AM, Johannes Kroll
<johannes.kroll(a)wikimedia.de> wrote:
Not sure how practical it would be to put this
all into one graph. In
CatGraph we have one instance for the category/page structure of each
supported language. These live in separate processes each. They talk
to a server process which talks to clients. The largest single graph
has about 76 million arcs (enwiki category links). Currently there is
one server process on one host, but we could distribute that to several
hosts when we need to, e.g. when we need more RAM. This approach scales
better than one single in-memory graph for *everything*.
Well, we routinely handle in RAM on a laptop graphs two orders of magnitude larger. The
point is which is your graph representation--we developed compressed representations that
reduce by an order of magnitude the size of the graph in memory. If you have a look at
http://law.di.unimi.it/webdata/enwiki-2013/, you'll see that all English wikipedia (no
templates) is 159MBs. With all categories and category links is 230MB. Lists of titles in
lexicographical order compress very well using prefix omission.
We store page_ids only, or any other integer IDs. Tools using it
fetch all other data from SQL. This makes sense for Tools on Labs for
example, which have access to the DB replica anyway. We don't compress
anything which makes it quite fast.
Moreover, our proposal is to distribute an embedded,
easy-to-use graph. Load it in memory and access successors lists, and that's it.
Setting up a service is a more complex goal, it is more complex to use and it gives slower
access.
It isn't a goal, the service already exists. The data you get is fresh,
automatically updated every hour or so, unlike a graph that you would
download. It's as easy to use as any other software library that you
pull into your script with "import foo". As to speed, most results are
pretty much instant. Try it:
https://wikitech.wikimedia.org/wiki/Nova_Resource:Catgraph