On Tue, Dec 17, 2013 at 4:57 PM, Sebastiano Vigna <vigna(a)di.unimi.it> wrote:
On 17 Dec 2013, at 6:01 AM, Johannes Kroll
<johannes.kroll(a)wikimedia.de> wrote:
Not sure how practical it would be to put this
all into one graph. In
CatGraph we have one instance for the category/page structure of each
supported language. These live in separate processes each. They talk
to a server process which talks to clients. The largest single graph
has about 76 million arcs (enwiki category links). Currently there is
one server process on one host, but we could distribute that to several
hosts when we need to, e.g. when we need more RAM. This approach scales
better than one single in-memory graph for *everything*.
Well, we routinely handle in RAM on a laptop graphs two orders of magnitude larger. The
point is which is your graph representation--we developed compressed representations that
reduce by an order of magnitude the size of the graph in memory. If you have a look at
http://law.di.unimi.it/webdata/enwiki-2013/, you'll see that all English wikipedia (no
templates) is 159MBs. With all categories and category links is 230MB. Lists of titles in
lexicographical order compress very well using prefix omission.
Moreover, our proposal is to distribute an embedded, easy-to-use graph. Load it in memory
and access successors lists, and that's it. Setting up a service is a more complex
goal, it is more complex to use and it gives slower access. It's like an SQL server
vs. an embedded BerkeleyDB database.
In my opinion, the first step would be to have the graphs (links, categories)
in a simple plain format: e.g, for each entity, represented by its
wikiId, the outcoming links, or the categories (still represented by
wikiIds) separated by tabs and provided by Wikipedia... and then a
simple easy-to-use framework for indexing and navigating the graph.
Cheers,
Diego
Ciao,
seba
_______________________________________________
Analytics mailing list
Analytics(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics
--
Computers are useless. They can only give you answers.
(Pablo Picasso)
_______________
Diego Ceccarelli
High Performance Computing Laboratory
Information Science and Technologies Institute (ISTI)
Italian National Research Council (CNR)
Via Moruzzi, 1
56124 - Pisa - Italy
Phone: +39 050 315 2984
Fax: +39 050 315 2040
________________________________________