[Reposted from private discussion after Dario's request]
My problem is that of exploring the graph structure of Wikipedia
1) easily;
2) reproducibly;
3) in a way that does not depend on parsing artifacts.
Presently, when people wants to do this they either do their own parsing of the dumps, or
they use the SQL data, or they download a dataset like
http://law.di.unimi.it/webdata/enwiki-2013/
which has everything "cooked up".
My frustration in the last few days was when trying to add the category links. I
didn't realize (well, it's not very documented) that bliki extracts all links and
render them in HTML *except* for the category links, that are instead accessible
programmatically. Once I got there, I was able to make some progress.
Nonetheless, I think that the graph of Wikipedia connections (hyperlinks and category
links) is really a mine of information and it is a pity that a lot of huffing and puffing
is necessary to do something as simple as a reverse visit of the category links from
"People" to get, actually, all people pages (this is a bit more
complicated--there are many false positives, but after a couple of fixes worked quite
well).
Moreover, one has continuously this feeling of walking on eggshells: a small change in
bliki, a small change in the XML format and everything might stop working is such a subtle
manner that you realize it only after a long time.
I was wondering if Wikimedia would be interested in distributing in compressed form the
Wikipedia graph. That would be the "official" Wikipedia graph--the benefits, in
particular for people working on leveraging semantic information from Wikipedia, would be
really significant.
I would (obviously) propose to use our Java framework, WebGraph, which is actually quite
standard in distributing large (well, actually much larger) graphs, such as ClueWeb09
http://lemurproject.org/clueweb09/, ClueWeb12
http://lemurproject.org/clueweb12/ and the
recent Common Web Crawl
http://webdatacommons.org/hyperlinkgraph/index.html. But any
format is OK, even a pair of integers per line. The advantage of a binary compressed form
is reduced network utilization, instantaneous availability of the information, etc.
Probably it would be useful to actually distribute several graphs with the same
dataset--e.g., the category links, the content link, etc. It is immediate, using WebGraph,
to build a union (i.e., a superposition) of any set of such graphs and use it
transparently as a single graph.
In my mind the distributed graph should have a contiguous ID space, say, induced by the
lexicographical order of the titles (possibly placing template pages at the start or at
the end of the ID space). We should provide graphs, and a bidirectional node<->title
map. All such information would use about 300M of space for the current English Wikipedia.
People could then associate pages to nodes using the title as a key.
But this last part is just rambling. :)
Let me know if you people are interested. We can of course take care of the process of
cooking up the information once it is out of the SQL database.
Ciao,
seba