Re: [Analytics] Distributing an official graph

17 Dec 2013

On 17 Dec 2013, at 6:01 AM, Johannes Kroll &lt;johannes.kroll(a)wikimedia.de&gt; wrote:

...
  Not sure how practical it would be to put this all
into one graph. In
 CatGraph we have one instance for the category/page structure of each
 supported language. These live in separate processes each. They talk
 to a server process which talks to clients. The largest single graph
 has about 76 million arcs (enwiki category links).  Currently there is
 one server process on one host, but we could distribute that to several
 hosts when we need to, e.g. when we need more RAM. This approach scales
 better than one single in-memory graph for *everything*. 

Well, we routinely handle in RAM on a laptop graphs two orders of magnitude larger. The
point is which is your graph representation--we developed compressed representations that
reduce by an order of magnitude the size of the graph in memory. If you have a look at
http://law.di.unimi.it/webdata/enwiki-2013/, you'll see that all English wikipedia (no
templates) is 159MBs. With all categories and category links is 230MB. Lists of titles in
lexicographical order compress very well using prefix omission.

Moreover, our proposal is to distribute an embedded, easy-to-use graph. Load it in memory
and access successors lists, and that's it. Setting up a service is a more complex
goal, it is more complex to use and it gives slower access. It's like an SQL server
vs. an embedded BerkeleyDB database.

Ciao,

					seba

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

Re: [Analytics] Distributing an official graph