Re: [Analytics] Distributing an official graph

11 Dec 2013

This request seems like it could be easy to fulfill.  Am I understanding
correctly that the dataset being sought would simply contain a list of
pairs of pages (in the cases of internal links) and a list of page/category
pairs (in the case of categorization)?

We can simply just dump out the categorylinks and pagelinks tables in order
to meet these needs.  I understand that the SQL dump format is painful to
deal with, but a plain CSV/TSV format should be fine.  The mysql cleint
will created a solid TSV format if you just pipe a query to the client and
stream that out to a file (or better yet, bzip and then a file).

These two tables track the most recent categorization/link structure of
pages, so we wouldn't be able to use them historically.

-Aaron

On Tue, Dec 10, 2013 at 12:09 AM, Sebastiano Vigna &lt;vigna(a)di.unimi.it&gt;wrote;wrote:

...
  [Reposted from private discussion after Dario's
request]

 My problem is that of exploring the graph structure of Wikipedia

 1) easily;
 2) reproducibly;
 3) in a way that does not depend on parsing artifacts.

 Presently, when people wants to do this they either do their own parsing
 of the dumps, or they use the SQL data, or they download a dataset like

 http://law.di.unimi.it/webdata/enwiki-2013/

 which has everything "cooked up".

 My frustration in the last few days was when trying to add the category
 links. I didn't realize (well, it's not very documented) that bliki
 extracts all links and render them in HTML *except* for the category links,
 that are instead accessible programmatically. Once I got there, I was able
 to make some progress.

 Nonetheless, I think that the graph of Wikipedia connections (hyperlinks
 and category links) is really a mine of information and it is a pity that a
 lot of huffing and puffing is necessary to do something as simple as a
 reverse visit of the category links from "People" to get, actually, all
 people pages (this is a bit more complicated--there are many false
 positives, but after a couple of fixes worked quite well).

 Moreover, one has continuously this feeling of walking on eggshells: a
 small change in bliki, a small change in the XML format and everything
 might stop working is such a subtle manner that you realize it only after a
 long time.

 I was wondering if Wikimedia would be interested in distributing in
 compressed form the Wikipedia graph. That would be the "official" Wikipedia
 graph--the benefits, in particular for people working on leveraging
 semantic information from Wikipedia, would be really significant.

 I would (obviously) propose to use our Java framework, WebGraph, which is
 actually quite standard in distributing large (well, actually much larger)
 graphs, such as ClueWeb09 http://lemurproject.org/clueweb09/, ClueWeb12
 http://lemurproject.org/clueweb12/ and the recent Common Web Crawl
 http://webdatacommons.org/hyperlinkgraph/index.html. But any format is
 OK, even a pair of integers per line. The advantage of a binary compressed
 form is reduced network utilization, instantaneous availability of the
 information, etc.

 Probably it would be useful to actually distribute several graphs with the
 same dataset--e.g., the category links, the content link, etc. It is
 immediate, using WebGraph, to build a union (i.e., a superposition) of any
 set of such graphs and use it transparently as a single graph.

 In my mind the distributed graph should have a contiguous ID space, say,
 induced by the lexicographical order of the titles (possibly placing
 template pages at the start or at the end of the ID space). We should
 provide graphs, and a bidirectional node<->title map. All such information
 would use about 300M of space for the current English Wikipedia. People
 could then associate pages to nodes using the title as a key.

 But this last part is just rambling. :)

 Let me know if you people are interested. We can of course take care of
 the process of cooking up the information once it is out of the SQL
 database.

 Ciao,

                                         seba

 _______________________________________________
 Analytics mailing list
 Analytics(a)lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/analytics

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

Re: [Analytics] Distributing an official graph