(cross-posting Sebastiano’s post from the analytics list, this may be of interest to both the wikidata and wiki-research-l communities)
Begin forwarded message:
From: Sebastiano Vigna vigna@di.unimi.it Subject: [Analytics] Distributing an official graph Date: December 9, 2013 at 10:09:31 PM PST
[Reposted from private discussion after Dario's request]
My problem is that of exploring the graph structure of Wikipedia
- easily;
- reproducibly;
- in a way that does not depend on parsing artifacts.
Presently, when people wants to do this they either do their own parsing of the dumps, or they use the SQL data, or they download a dataset like
http://law.di.unimi.it/webdata/enwiki-2013/
which has everything "cooked up".
My frustration in the last few days was when trying to add the category links. I didn't realize (well, it's not very documented) that bliki extracts all links and render them in HTML *except* for the category links, that are instead accessible programmatically. Once I got there, I was able to make some progress.
Nonetheless, I think that the graph of Wikipedia connections (hyperlinks and category links) is really a mine of information and it is a pity that a lot of huffing and puffing is necessary to do something as simple as a reverse visit of the category links from "People" to get, actually, all people pages (this is a bit more complicated--there are many false positives, but after a couple of fixes worked quite well).
Moreover, one has continuously this feeling of walking on eggshells: a small change in bliki, a small change in the XML format and everything might stop working is such a subtle manner that you realize it only after a long time.
I was wondering if Wikimedia would be interested in distributing in compressed form the Wikipedia graph. That would be the "official" Wikipedia graph--the benefits, in particular for people working on leveraging semantic information from Wikipedia, would be really significant.
I would (obviously) propose to use our Java framework, WebGraph, which is actually quite standard in distributing large (well, actually much larger) graphs, such as ClueWeb09 http://lemurproject.org/clueweb09/, ClueWeb12 http://lemurproject.org/clueweb12/ and the recent Common Web Crawl http://webdatacommons.org/hyperlinkgraph/index.html. But any format is OK, even a pair of integers per line. The advantage of a binary compressed form is reduced network utilization, instantaneous availability of the information, etc.
Probably it would be useful to actually distribute several graphs with the same dataset--e.g., the category links, the content link, etc. It is immediate, using WebGraph, to build a union (i.e., a superposition) of any set of such graphs and use it transparently as a single graph.
In my mind the distributed graph should have a contiguous ID space, say, induced by the lexicographical order of the titles (possibly placing template pages at the start or at the end of the ID space). We should provide graphs, and a bidirectional node<->title map. All such information would use about 300M of space for the current English Wikipedia. People could then associate pages to nodes using the title as a key.
But this last part is just rambling. :)
Let me know if you people are interested. We can of course take care of the process of cooking up the information once it is out of the SQL database.
Ciao,
seba
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Hi,
I think this is definitively a great idea which will save lots of researchers a ton of work.
Cheers,
+1
Same here, also, using a standardized dataset would make much easier to reproduce others' work.
G
On Sun 15 Dec 2013 05:19:54 AM EST, Carlos Castillo wrote:
Hi,
I think this is definitively a great idea which will save lots of researchers a ton of work.
Cheers,
-- Giovanni Luca Ciampaglia
Postdoctoral fellow Center for Complex Networks and Systems Research Indiana University
✎ 910 E 10th St ∙ Bloomington ∙ IN 47408 ☞ http://cnets.indiana.edu/ ✉ gciampag@indiana.edu ✆ 1-812-855-7261
While we (Research & Data @ WMF) consider maintaining a standard dump of categories and pagelinks, anyone can pull such a dataset from tool labs slaves (see http://tools.wmflabs.org) with the following two queries.
*/* Get all page links */* SELECT origin.page_id AS from_id, origin.page_namespace AS from_namespace, origin.page_title AS from_title, dest.page_id AS to_id, */* NULL if page doesn't exists */ * pl_namespace AS to_namespace, pl_title AS to_title FROM pagelinks LEFT JOIN page origin ON origin.page_id = pl_from LEFT JOIN page dest ON dest.page_namespace = pl_namespace AND dest.page_title = pl_title;
*/* Get all category links */* SELECT origin.page_id AS from_id, origin.page_namespace AS from_namespace, origin.page_title AS from_title, cl_to AS category_title FROM categorylinks LEFT JOIN page origin ON page_id = cl_from;
Note that these tables are very large. For English Wikipedia, pagelinkscontains ~900 million rows and categorylinks contains ~66 million rows.
-Aaron
On Mon, Dec 16, 2013 at 11:28 AM, Giovanni Luca Ciampaglia < glciampagl@gmail.com> wrote:
+1
Same here, also, using a standardized dataset would make much easier to reproduce others' work.
G
On Sun 15 Dec 2013 05:19:54 AM EST, Carlos Castillo wrote:
Hi,
I think this is definitively a great idea which will save lots of researchers a ton of work.
Cheers,
-- Giovanni Luca Ciampaglia
Postdoctoral fellow Center for Complex Networks and Systems Research Indiana University
✎ 910 E 10th St ∙ Bloomington ∙ IN 47408 ☞ http://cnets.indiana.edu/ ✉ gciampag@indiana.edu ✆ 1-812-855-7261
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Hi,
the Wikipedia category as well as page link graph is also available for various languages from the DBpedia download page:
http://dbpedia.org/Downloads39
Cheers,
Chris
Zitat von Aaron Halfaker aaron.halfaker@gmail.com:
While we (Research & Data @ WMF) consider maintaining a standard dump of categories and pagelinks, anyone can pull such a dataset from tool labs slaves (see http://tools.wmflabs.org) with the following two queries.
*/* Get all page links */* SELECT origin.page_id AS from_id, origin.page_namespace AS from_namespace, origin.page_title AS from_title, dest.page_id AS to_id, */* NULL if page doesn't exists */ * pl_namespace AS to_namespace, pl_title AS to_title FROM pagelinks LEFT JOIN page origin ON origin.page_id = pl_from LEFT JOIN page dest ON dest.page_namespace = pl_namespace AND dest.page_title = pl_title;
*/* Get all category links */* SELECT origin.page_id AS from_id, origin.page_namespace AS from_namespace, origin.page_title AS from_title, cl_to AS category_title FROM categorylinks LEFT JOIN page origin ON page_id = cl_from;
Note that these tables are very large. For English Wikipedia, pagelinkscontains ~900 million rows and categorylinks contains ~66 million rows.
-Aaron
On Mon, Dec 16, 2013 at 11:28 AM, Giovanni Luca Ciampaglia < glciampagl@gmail.com> wrote:
+1
Same here, also, using a standardized dataset would make much easier to reproduce others' work.
G
On Sun 15 Dec 2013 05:19:54 AM EST, Carlos Castillo wrote:
Hi,
I think this is definitively a great idea which will save lots of researchers a ton of work.
Cheers,
-- Giovanni Luca Ciampaglia
Postdoctoral fellow Center for Complex Networks and Systems Research Indiana University
? 910 E 10th St ? Bloomington ? IN 47408 ? http://cnets.indiana.edu/ ? gciampag@indiana.edu ? 1-812-855-7261
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
wiki-research-l@lists.wikimedia.org