On 17 Dec 2013, at 6:01 AM, Johannes Kroll <johannes.kroll(a)wikimedia.de> wrote:
Not sure how practical it would be to put this all
into one graph. In
CatGraph we have one instance for the category/page structure of each
supported language. These live in separate processes each. They talk
to a server process which talks to clients. The largest single graph
has about 76 million arcs (enwiki category links). Currently there is
one server process on one host, but we could distribute that to several
hosts when we need to, e.g. when we need more RAM. This approach scales
better than one single in-memory graph for *everything*.
Well, we routinely handle in RAM on a laptop graphs two orders of magnitude larger. The
point is which is your graph representation--we developed compressed representations that
reduce by an order of magnitude the size of the graph in memory. If you have a look at
http://law.di.unimi.it/webdata/enwiki-2013/, you'll see that all English wikipedia (no
templates) is 159MBs. With all categories and category links is 230MB. Lists of titles in
lexicographical order compress very well using prefix omission.
Moreover, our proposal is to distribute an embedded, easy-to-use graph. Load it in memory
and access successors lists, and that's it. Setting up a service is a more complex
goal, it is more complex to use and it gives slower access. It's like an SQL server
vs. an embedded BerkeleyDB database.
Ciao,
seba