On Sun, 15 Dec 2013 08:51:33 -0800
Sebastiano Vigna <vigna(a)di.unimi.it> wrote:
On 15 Dec 2013, at 3:30 AM, Maarten Dammers
<maarten(a)mdammers.nl> wrote:
I would love to have some sort of dump or (even
better) a central service I can query. It should contain for all Wikimedia projects:
* Page links (page A links to page B)
* Category links (page A is in category C)
* Image links (page A uses image I)
* Interlanguage links (page A in language en links to page A' in language nl)
* Interproject links (page A in the English Wikipedia links to page A' on Wikimedia
Commons)
And to make it really complete:
* Wikidata claims (item A has a claim pointing to item B)
Well, somehow that would be a more interesting graph to build--all pages, all languages,
all images, all categories, all together. Probably in the 100M pages/5B arcs range. Having
all languages together would help to make inference/learning easier by working on the
English part and then propagating the results. At that point, actually, compression would
be essential in making it an in-core data structure.
Not sure how practical it would be to put this all into one graph. In
CatGraph we have one instance for the category/page structure of each
supported language. These live in separate processes each. They talk
to a server process which talks to clients. The largest single graph
has about 76 million arcs (enwiki category links). Currently there is
one server process on one host, but we could distribute that to several
hosts when we need to, e.g. when we need more RAM. This approach scales
better than one single in-memory graph for *everything*.