Distributing an official graph

Commons Category Tree Use

Sebastiano Vigna

10 Dec 2013 10 Dec '13

6:09 a.m.

[Reposted from private discussion after Dario's request] My problem is that of exploring the graph structure of Wikipedia 1) easily; 2) reproducibly; 3) in a way that does not depend on parsing artifacts. Presently, when people wants to do this they either do their own parsing of the dumps, or they use the SQL data, or they download a dataset like http://law.di.unimi.it/webdata/enwiki-2013/ which has everything "cooked up". My frustration in the last few days was when trying to add the category links. I didn't realize (well, it's not very documented) that bliki extracts all links and render them in HTML *except* for the category links, that are instead accessible programmatically. Once I got there, I was able to make some progress. Nonetheless, I think that the graph of Wikipedia connections (hyperlinks and category links) is really a mine of information and it is a pity that a lot of huffing and puffing is necessary to do something as simple as a reverse visit of the category links from "People" to get, actually, all people pages (this is a bit more complicated--there are many false positives, but after a couple of fixes worked quite well). Moreover, one has continuously this feeling of walking on eggshells: a small change in bliki, a small change in the XML format and everything might stop working is such a subtle manner that you realize it only after a long time. I was wondering if Wikimedia would be interested in distributing in compressed form the Wikipedia graph. That would be the "official" Wikipedia graph--the benefits, in particular for people working on leveraging semantic information from Wikipedia, would be really significant. I would (obviously) propose to use our Java framework, WebGraph, which is actually quite standard in distributing large (well, actually much larger) graphs, such as ClueWeb09 http://lemurproject.org/clueweb09/, ClueWeb12 http://lemurproject.org/clueweb12/ and the recent Common Web Crawl http://webdatacommons.org/hyperlinkgraph/index.html. But any format is OK, even a pair of integers per line. The advantage of a binary compressed form is reduced network utilization, instantaneous availability of the information, etc. Probably it would be useful to actually distribute several graphs with the same dataset--e.g., the category links, the content link, etc. It is immediate, using WebGraph, to build a union (i.e., a superposition) of any set of such graphs and use it transparently as a single graph. In my mind the distributed graph should have a contiguous ID space, say, induced by the lexicographical order of the titles (possibly placing template pages at the start or at the end of the ID space). We should provide graphs, and a bidirectional node<->title map. All such information would use about 300M of space for the current English Wikipedia. People could then associate pages to nodes using the title as a key. But this last part is just rambling. :) Let me know if you people are interested. We can of course take care of the process of cooking up the information once it is out of the SQL database. Ciao, seba

Show replies by date

Aaron Halfaker

10 Dec 10 Dec

3:46 p.m.

This request seems like it could be easy to fulfill. Am I understanding correctly that the dataset being sought would simply contain a list of pairs of pages (in the cases of internal links) and a list of page/category pairs (in the case of categorization)? We can simply just dump out the categorylinks and pagelinks tables in order to meet these needs. I understand that the SQL dump format is painful to deal with, but a plain CSV/TSV format should be fine. The mysql cleint will created a solid TSV format if you just pipe a query to the client and stream that out to a file (or better yet, bzip and then a file). These two tables track the most recent categorization/link structure of pages, so we wouldn't be able to use them historically. -Aaron On Tue, Dec 10, 2013 at 12:09 AM, Sebastiano Vigna <vigna(a)di.unimi.it>wrote;wrote:

...

Dan Andreescu

7:28 p.m.

On Tue, Dec 10, 2013 at 10:46 AM, Aaron Halfaker <ahalfaker(a)wikimedia.org>wrote;wrote:

...

Agreed, dumping the data seems straightforward. You could write a script using mysqldump on the labs db instances. It could update the graph definition daily for example.

...

These two tables track the most recent categorization/link structure of pages, so we wouldn't be able to use them historically.

I think as soon as we dump out the graph, we are adding a time dimension to it. Any analysis using the graph would have to take this into account. Analyses that need to look at things over time would then run into the problem of accessing historical versions of the graph. So it seems like the smart thing to do for now is just ignore that until it comes up :)

Sebastiano Vigna

11:38 p.m.

On 10 Dec 2013, at 9:46 AM, Aaron Halfaker <ahalfaker(a)wikimedia.org> wrote:

...

Yes, but one thing that must be done is normalize the id space. Presently category and pages have overlapping id spaces. They're also non-contiguous, which is a pain for running, say, any ranking algorithm.

...

We can simply just dump out the categorylinks and pagelinks tables in order to meet these needs. I understand that the SQL dump format is painful to deal with, but a plain CSV/TSV format should be fine. The mysql cleint will created a solid TSV format if you just pipe a query to the client and stream that out to a file (or better yet, bzip and then a file).

We would need to 1) decide an id space organization. 2) dump data translating it into the id space. 3) build the graph. For 1) I think it would be good to have, like, spaces [0..x) [x..y), one for categories and one for pages. For 2) it's just a matter of a Java class fiddling with the ids and the titles (the format for link is asymmetrical). For 3) I'd love to release a binary compressed version because it takes much less space, it is immediately usable and if you want to dump the pairs <x,y> in ASCII is just a single command line.

...

These two tables track the most recent categorization/link structure of pages, so we wouldn't be able to use them historically.

Someone previously asked for temporal data. How can we get access to that? We might provide a label file with on-off dates for every, say, category link. Ciao, seba

Aaron Halfaker

11 Dec 11 Dec

3:46 a.m.

I'm not sure what you are referring to when you say "id space". A page can be identified by it's page_id or the pair: (page_namespace, page_title). A category can be identified by its title. You could enumerate them how you like post-hoc.

...

Someone previously asked for temporal data. How can we get access to that?

It doesn't exist. We could start recording a history now, but without a clear use-case, I'm not sure it's worth the time. -Aaron On Tue, Dec 10, 2013 at 5:38 PM, Sebastiano Vigna <vigna(a)di.unimi.it> wrote: > On 10 Dec 2013, at 9:46 AM, Aaron Halfaker <ahalfaker(a)wikimedia.org> > wrote: > > > This request seems like it could be easy to fulfill. Am I understanding > correctly that the dataset being sought would simply contain a list of > pairs of pages (in the cases of internal links) and a list of page/category > pairs (in the case of categorization)? > > Yes, but one thing that must be done is normalize the id space. Presently > category and pages have overlapping id spaces. They're also non-contiguous, > which is a pain for running, say, any ranking algorithm. > > > We can simply just dump out the categorylinks and pagelinks tables in > order to meet these needs. I understand that the SQL dump format is > painful to deal with, but a plain CSV/TSV format should be fine. The mysql > cleint will created a solid TSV format if you just pipe a query to the > client and stream that out to a file (or better yet, bzip and then a file). > > We would need to 1) decide an id space organization. 2) dump data > translating it into the id space. 3) build the graph. > > For 1) I think it would be good to have, like, spaces [0..x) [x..y), one > for categories and one for pages. For 2) it's just a matter of a Java class > fiddling with the ids and the titles (the format for link is asymmetrical). > For 3) I'd love to release a binary compressed version because it takes > much less space, it is immediately usable and if you want to dump the pairs > <x,y> in ASCII is just a single command line. > > > These two tables track the most recent categorization/link structure of > pages, so we wouldn't be able to use them historically. >

...

Someone previously asked for temporal data. How can we get access to that?

> We might provide a label file with on-off dates for every, say, category > link. > > Ciao, > > seba > > > _______________________________________________ > Analytics mailing list > Analytics(a)lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/analytics >

Sebastiano Vigna

4:09 a.m.

On 10 Dec 2013, at 9:46 PM, Aaron Halfaker <ahalfaker(a)wikimedia.org> wrote:

...

The set of page ids in the SQL dump is not contiguous. And the ids of categories overlap with the ids of pages. The best thing, from a computational perspective, is that if n is the number of pages plus the number of category pages every page or category page is assigned a node number in the interval [0..n). To make thing easier (in particular, compression of strings) it might be useful to assign these node numbers by listing titles lexicographically, maybe first enumerating category titles and then enumerating page titles. My idea would be distributing a graph in binary compressed format with a compact id space [0..n), together with a bidirectional map title <-> node numbers. The map would be the bridge between the graph and the SQL dumps/Wikipedia text.

...

Someone previously asked for temporal data. How can we get access to that?

It doesn't exist. We could start recording a history now, but without a clear use-case, I'm not sure it's worth the time.

I thought from previous comments that it was available--my mistake! Ciao, seba

Dario Taraborelli

4:31 a.m.

the pagelinks table doesn’t record the link creation history but categorylinks does include a timestamp of the most recent change for each individual record [1]. Dario [1] https://www.mediawiki.org/wiki/Manual:Categorylinks_table On Dec 10, 2013, at 8:09 PM, Sebastiano Vigna <vigna(a)di.unimi.it> wrote:

...

On 10 Dec 2013, at 9:46 PM, Aaron Halfaker <ahalfaker(a)wikimedia.org> wrote:

Someone previously asked for temporal data. How can we get access to that?

It doesn't exist. We could start recording a history now, but without a clear use-case, I'm not sure it's worth the time.

I thought from previous comments that it was available--my mistake! Ciao, seba _______________________________________________ Analytics mailing list Analytics(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Aaron Halfaker

4:39 a.m.

...

And the ids of categories overlap with the ids of pages.

Categories don't actually have ids. Categories are more like tags in that they exist as soon as a page is "linked" to one. Many categories have corresponding pages in the "Category" namespace that describe them, but a category "exists" before a page is created.

...

The best thing, from a computational perspective, is that if n is the

number of pages plus the number of category pages every page or category page is assigned a node number in the interval [0..n). Surely you can build a hash map on whatever unique identifier you like and get constant (amortized) lookup speed. I think it is best if we provide you with a raw format that will work and you do your own post processing to obtain the "id space" that you like. -Aaron On Tue, Dec 10, 2013 at 10:09 PM, Sebastiano Vigna <vigna(a)di.unimi.it>wrote;wrote: > On 10 Dec 2013, at 9:46 PM, Aaron Halfaker <ahalfaker(a)wikimedia.org> > wrote: > > > I'm not sure what you are referring to when you say "id space". A page > can be identified by it's page_id or the pair: (page_namespace, > page_title). A category can be identified by its title. You could > enumerate them how you like post-hoc. > > The set of page ids in the SQL dump is not contiguous. And the ids of > categories overlap with the ids of pages. >

...

The best thing, from a computational perspective, is that if n is the

> number of pages plus the number of category pages every page or category > page is assigned a node number in the interval [0..n). To make thing easier > (in particular, compression of strings) it might be useful to assign these > node numbers by listing titles lexicographically, maybe first enumerating > category titles and then enumerating page titles. > > My idea would be distributing a graph in binary compressed format with a > compact id space [0..n), together with a bidirectional map title <-> node > numbers. The map would be the bridge between the graph and the SQL > dumps/Wikipedia text. > > > > Someone previously asked for temporal data. How can we get access to > that? > > > > It doesn't exist. We could start recording a history now, but without a > clear use-case, I'm not sure it's worth the time. > > I thought from previous comments that it was available--my mistake! > > Ciao, > > seba > > > _______________________________________________ > Analytics mailing list > Analytics(a)lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/analytics >

Sebastiano Vigna

4:48 a.m.

On 10 Dec 2013, at 10:39 PM, Aaron Halfaker <ahalfaker(a)wikimedia.org> wrote:

...

OK. Let's say than that it would be nice to have a graph containing all noncategory-pages and all categories. Probably the links from/to the categories should be just the category link (even if a category with a page might contain other linked content).

...

The best thing, from a computational perspective, is that if n is the number of pages plus the number of category pages every page or category page is assigned a node number in the interval [0..n).

Surely you can build a hash map on whatever unique identifier you like and get constant (amortized) lookup speed.

FYI, that was before cuckoo hashing. We have now dictionaries with actual constant lookup speed (not amortized).

...

I think it is best if we provide you with a raw format that will work and you do your own post processing to obtain the "id space" that you like.

It depends what you mean with "you". :) My idea is that it would have been nice to have a "gold standard" Wikipedia graph so that there is no postprocessing done by the end user. This is in the interest of reproducibility--it's very easy with large dataset to miss some trivial detail and then things go amok. I would be happy to help to make this process to be efficient and convenient so that it can be performed at each dump. That's also the purpose of deciding a convenient ID space and then give away a bidirectional link map. The ids inside the SQL tables are artifacts of Wikipedia's construction phases--we should hide them. Ciao, seba

Aaron Halfaker

2:17 p.m.

...

This is in the interest of reproducibility--it's very easy with large

dataset to miss some trivial detail and then things go amok Yes. It would be nice if we didn't have to manage such details.

...

The ids inside the SQL tables are artifacts of Wikipedia's construction

phases--we should hide them. I don't think that hiding IDs that could be used to, say, look up content or compare against user contrib histories is a good idea. -Aaron On Tue, Dec 10, 2013 at 10:48 PM, Sebastiano Vigna <vigna(a)di.unimi.it>wrote;wrote:

...

On 10 Dec 2013, at 10:39 PM, Aaron Halfaker <ahalfaker(a)wikimedia.org> wrote:

Categories don't actually have ids. Categories are more like tags in

that they exist as soon as a page is "linked" to one. Many categories have corresponding pages in the "Category" namespace that describe them, but a category "exists" before a page is created. OK. Let's say than that it would be nice to have a graph containing all noncategory-pages and all categories. Probably the links from/to the categories should be just the category link (even if a category with a page might contain other linked content).

> The best thing, from a computational perspective, is that if n is the

number of pages plus the number of category pages every page or category page is assigned a node number in the interval [0..n).

Surely you can build a hash map on whatever unique identifier you like

and get constant (amortized) lookup speed. FYI, that was before cuckoo hashing. We have now dictionaries with actual constant lookup speed (not amortized).

I think it is best if we provide you with a raw format that will work

and you do your own post processing to obtain the "id space" that you like. It depends what you mean with "you". :) My idea is that it would have been nice to have a "gold standard" Wikipedia graph so that there is no postprocessing done by the end user. This is in the interest of reproducibility--it's very easy with large dataset to miss some trivial detail and then things go amok. I would be happy to help to make this process to be efficient and convenient so that it can be performed at each dump. That's also the purpose of deciding a convenient ID space and then give away a bidirectional link map. The ids inside the SQL tables are artifacts of Wikipedia's construction phases--we should hide them. Ciao, seba _______________________________________________ Analytics mailing list Analytics(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Sebastiano Vigna

2:33 p.m.

On 11 Dec 2013, at 8:17 AM, Aaron Halfaker <ahalfaker(a)wikimedia.org> wrote:

...

This is in the interest of reproducibility--it's very easy with large dataset to miss some trivial detail and then things go amok

Yes. It would be nice if we didn't have to manage such details.

I'm sorry--I'm a bit confused. I understand you are from Wikipedia--the intended meaning of the above phrase is that you guys at Wikipedia are not interested in generating and distributing such a graph?

...

The ids inside the SQL tables are artifacts of Wikipedia's construction phases--we should hide them.

I don't think that hiding IDs that could be used to, say, look up content or compare against user contrib histories is a good idea.

Maybe "hiding" is not the correct word. I would also distribute an array of SQL ids indexed by node numbers to access the SQL data if necessary--the point is that when you work with the graph a contiguous ID space is essentially. Otherwise, for example, all vector norms in the computation of spectral rankings are altered by the existence of numerous isolated nodes. Ciao, seba

Johannes Kroll

2:44 p.m.

Hi all, you may be interested in https://wikitech.wikimedia.org/wiki/Nova_Resource:Catgraph Cheers Johannes On Wed, 11 Dec 2013 08:33:22 -0600 Sebastiano Vigna <vigna(a)di.unimi.it> wrote:

...

On 11 Dec 2013, at 8:17 AM, Aaron Halfaker <ahalfaker(a)wikimedia.org> wrote:

This is in the interest of reproducibility--it's very easy with large dataset to miss some trivial detail and then things go amok

Yes. It would be nice if we didn't have to manage such details.

The ids inside the SQL tables are artifacts of Wikipedia's construction phases--we should hide them.

I don't think that hiding IDs that could be used to, say, look up content or compare against user contrib histories is a good idea.

Sebastiano Vigna

2:58 p.m.

On 11 Dec 2013, at 8:44 AM, Johannes Kroll <johannes.kroll(a)wikimedia.de> wrote:

...

Hi all, you may be interested in https://wikitech.wikimedia.org/wiki/Nova_Resource:Catgraph

That is close to what I had in mind for the category part, albeit it appears to be accessible as a server, whereas a WebGraph instance is accessed as an embedded library, which is significantly faster. The point of having the whole graph is that you can use also the other links to make inferences to validate/patch the category hierarchy (e.g., people pages should have a higher percentage of links to/from people pages, so you could use the category as a base vector and run some iterative process to catch missing items). Ciao, seba

Diego Ceccarelli

3:28 p.m.

Hi all, I agree with Sebastiano, it would be really useful to have 'plain' version of the graph. It would be also nice to have a plain version of Wikipedia, with all the articles organized in fields (title, categories, links, images, etc etc). I recently wrote code [1] to convert the dump in json, so that each article can be easily pushed in an object without handling the parsing of the fields. I think that having such kind of data sets directly provided by Wikipedia would really improve quality of research and applications. Of course, I would be happy to help if needed ;) Cheers, Diego [1] https://github.com/diegoceccarelli/json-wikipedia On Wed, Dec 11, 2013 at 3:44 PM, Johannes Kroll <johannes.kroll(a)wikimedia.de> wrote:

...

Hi all, you may be interested in https://wikitech.wikimedia.org/wiki/Nova_Resource:Catgraph Cheers Johannes On Wed, 11 Dec 2013 08:33:22 -0600 Sebastiano Vigna <vigna(a)di.unimi.it> wrote: > On 11 Dec 2013, at 8:17 AM, Aaron Halfaker <ahalfaker(a)wikimedia.org> wrote: > > > > This is in the interest of reproducibility--it's very easy with large dataset to miss some trivial detail and then things go amok

...

Yes. It would be nice if we didn't have to manage such details.

The ids inside the SQL tables are artifacts of Wikipedia's construction phases--we should hide them.

I don't think that hiding IDs that could be used to, say, look up content or compare against user contrib histories is a good idea.

_______________________________________________ Analytics mailing list Analytics(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

-- Computers are useless. They can only give you answers. (Pablo Picasso) _______________ Diego Ceccarelli High Performance Computing Laboratory Information Science and Technologies Institute (ISTI) Italian National Research Council (CNR) Via Moruzzi, 1 56124 - Pisa - Italy Phone: +39 050 315 2984 Fax: +39 050 315 2040 ________________________________________

Maarten Dammers

15 Dec 15 Dec

11:30 a.m.

Hi everyone, I've been playing around with the structure of category graph for quite some time for https://commons.wikimedia.org/wiki/User:CategorizationBot . This bot takes an uncategorized image, looks where it is used and tries to find relevant categories at Commons. It applies some filters and one of them is the filter against over categorization (https://commons.wikimedia.org/wiki/Commons:OVERCAT). For this I create a simple child->parent table that used to live on the Toolserver, but now appears to have vanished. I would love to have some sort of dump or (even better) a central service I can query. It should contain for all Wikimedia projects: * Page links (page A links to page B) * Category links (page A is in category C) * Image links (page A uses image I) * Interlanguage links (page A in language en links to page A' in language nl) * Interproject links (page A in the English Wikipedia links to page A' on Wikimedia Commons) And to make it really complete: * Wikidata claims (item A has a claim pointing to item B) Maarten

Johannes Kroll

3:27 p.m.

Hi Maarten, On Sun, 15 Dec 2013 12:30:24 +0100 Maarten Dammers <maarten(a)mdammers.nl> wrote:

...

this should be in the pagelinks table in the database replica.

...

* Category links (page A is in category C)

If you need only direct relations, this is in the categorylinks table. For recursive/transitive relations (page A is in a category that is in category C... and so on), that's what we currently have in Catgraph for a selection of wikis. http://sylvester.wmflabs.org:8090/list-graphs You can query Catgraph from your bot on Labs, or even from somewhere else (but you need access to the db replica as well - Catgraph only stores page_ids). We currently have the category links, but we could include other data.

...

* Image links (page A uses image I) * Interlanguage links (page A in language en links to page A' in language nl) * Interproject links (page A in the English Wikipedia links to page A' on Wikimedia Commons)

These should be in the database replica, right? imagelinks, languagelinks, iwlinks.

...

And to make it really complete: * Wikidata claims (item A has a claim pointing to item B)

I'm currently working on importing transitive properties, like "is in the administrative unit", to Catgraph. This depends on the database dumps in the new plain json format, which should become available Any Day Now (TM).

Maarten Dammers

3:53 p.m.

Hi Johannes, Johannes Kroll schreef op 15-12-2013 16:27:

...

I would love to have some sort of dump or (even better) a central service I can query. It should contain for all Wikimedia projects: * Page links (page A links to page B) this should be in the pagelinks table in the database replica.

Maybe I wasn't clear. I know how MediaWiki works and what tables to query [1], but it isn't designed for recursion or crawling it as a directed graph. That really kills performance and doesn't scale at all. You need a custom setup for that. Maarten [1] https://commons.wikimedia.org/wiki/File:MediaWiki_database_schema_latest.svg

Johannes Kroll

4:20 p.m.

On Sun, 15 Dec 2013 16:53:18 +0100 Maarten Dammers <maarten(a)mdammers.nl> wrote:

...

Hi Johannes, Johannes Kroll schreef op 15-12-2013 16:27:

Yes, and that is what Catgraph is about. It is a directed graph database made for exactly this kind of thing. We currently carry the category links, but we could import other graphs as well, such as pagelinks. I just wasn't sure whether you needed to do recursive queries or not. If you do, Catgraph is the thing. https://wikitech.wikimedia.org/wiki/Nova_Resource:Catgraph

Sebastiano Vigna

4:51 p.m.

On 15 Dec 2013, at 3:30 AM, Maarten Dammers <maarten(a)mdammers.nl> wrote:

...

I would love to have some sort of dump or (even better) a central service I can query. It should contain for all Wikimedia projects: * Page links (page A links to page B) * Category links (page A is in category C) * Image links (page A uses image I) * Interlanguage links (page A in language en links to page A' in language nl) * Interproject links (page A in the English Wikipedia links to page A' on Wikimedia Commons) And to make it really complete: * Wikidata claims (item A has a claim pointing to item B)

Well, somehow that would be a more interesting graph to build--all pages, all languages, all images, all categories, all together. Probably in the 100M pages/5B arcs range. Having all languages together would help to make inference/learning easier by working on the English part and then propagating the results. At that point, actually, compression would be essential in making it an in-core data structure. Ciao, seba

Johannes Kroll

17 Dec 17 Dec

2:01 p.m.

On Sun, 15 Dec 2013 08:51:33 -0800 Sebastiano Vigna <vigna(a)di.unimi.it> wrote:

...

On 15 Dec 2013, at 3:30 AM, Maarten Dammers <maarten(a)mdammers.nl> wrote:

Not sure how practical it would be to put this all into one graph. In CatGraph we have one instance for the category/page structure of each supported language. These live in separate processes each. They talk to a server process which talks to clients. The largest single graph has about 76 million arcs (enwiki category links). Currently there is one server process on one host, but we could distribute that to several hosts when we need to, e.g. when we need more RAM. This approach scales better than one single in-memory graph for *everything*.

Sebastiano Vigna

3:57 p.m.

On 17 Dec 2013, at 6:01 AM, Johannes Kroll <johannes.kroll(a)wikimedia.de> wrote:

...

Well, we routinely handle in RAM on a laptop graphs two orders of magnitude larger. The point is which is your graph representation--we developed compressed representations that reduce by an order of magnitude the size of the graph in memory. If you have a look at http://law.di.unimi.it/webdata/enwiki-2013/, you'll see that all English wikipedia (no templates) is 159MBs. With all categories and category links is 230MB. Lists of titles in lexicographical order compress very well using prefix omission. Moreover, our proposal is to distribute an embedded, easy-to-use graph. Load it in memory and access successors lists, and that's it. Setting up a service is a more complex goal, it is more complex to use and it gives slower access. It's like an SQL server vs. an embedded BerkeleyDB database. Ciao, seba

Diego Ceccarelli

4:09 p.m.

On Tue, Dec 17, 2013 at 4:57 PM, Sebastiano Vigna <vigna(a)di.unimi.it> wrote:

...

On 17 Dec 2013, at 6:01 AM, Johannes Kroll <johannes.kroll(a)wikimedia.de> wrote:

In my opinion, the first step would be to have the graphs (links, categories) in a simple plain format: e.g, for each entity, represented by its wikiId, the outcoming links, or the categories (still represented by wikiIds) separated by tabs and provided by Wikipedia... and then a simple easy-to-use framework for indexing and navigating the graph. Cheers, Diego

...

Ciao, seba _______________________________________________ Analytics mailing list Analytics(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Johannes Kroll

5:02 p.m.

On Tue, 17 Dec 2013 07:57:59 -0800 Sebastiano Vigna <vigna(a)di.unimi.it> wrote:

...

On 17 Dec 2013, at 6:01 AM, Johannes Kroll <johannes.kroll(a)wikimedia.de> wrote:

We store page_ids only, or any other integer IDs. Tools using it fetch all other data from SQL. This makes sense for Tools on Labs for example, which have access to the DB replica anyway. We don't compress anything which makes it quite fast.

...

Moreover, our proposal is to distribute an embedded, easy-to-use graph. Load it in memory and access successors lists, and that's it. Setting up a service is a more complex goal, it is more complex to use and it gives slower access.

It isn't a goal, the service already exists. The data you get is fresh, automatically updated every hour or so, unlike a graph that you would download. It's as easy to use as any other software library that you pull into your script with "import foo". As to speed, most results are pretty much instant. Try it: https://wikitech.wikimedia.org/wiki/Nova_Resource:Catgraph

Sebastiano Vigna

5:09 p.m.

On 17 Dec 2013, at 9:02 AM, Johannes Kroll <johannes.kroll(a)wikimedia.de> wrote:

...

Compression and speed are not one against another--quite the contrary. A standard compression format from WebGraph delivers an edge in ~50ns. Frankly, any service will requires orders of magnitude more. Do you have any timings to compare?

...

It isn't a goal, the service already exists. The data you get is fresh, automatically updated every hour or so, unlike a graph that you would

This is exactly what you don't want for research purposes: moving targets. You need a dataset, downloaded in some point in time (like Wikipedia dumps) that other people can use to replicate or results or improve them. Anything that is updated every hour is unusable for that purpose. It's a just a different goal. Once you nail down your algorithms it might be, of course, a good idea to run them on fresh data, but research requires replicability.

...

download. It's as easy to use as any other software library that you pull into your script with "import foo". As to speed, most results are pretty much instant. Try it:

"Instant" has for me no meaning. Can you quantify? Ciao, seba

Johannes Kroll

19 Dec 19 Dec

12:28 p.m.

On Tue, 17 Dec 2013 09:09:08 -0800 Sebastiano Vigna <vigna(a)di.unimi.it> wrote:

...

On 17 Dec 2013, at 9:02 AM, Johannes Kroll <johannes.kroll(a)wikimedia.de> wrote:

Yes. In the link I posted in the mail you quoted, there's an example query including a set operation. The timing includes setting up the connection, doing the two queries and the set operation, converting the result to the line-based format, and transferring that over HTTP. This is a real-world query, and about the same as you would get in a tool that runs on Labs which uses CatGraph (minus the overhead from starting the Curl binary, setting up the connection, and the slight overhead from HTTP, because you would use plain TCP transfers in such a tool). You can login to Tool Labs and try various queries yourself. "Deliver an edge in 50ns" sounds impressive, but this value doesn't mean much without context. What does it mean?

...

It isn't a goal, the service already exists. The data you get is fresh, automatically updated every hour or so, unlike a graph that you would

If you need information about the current state of Wikipedia, anything that doesn't reflect the current state of Wikipedia is simply not useful. That's the case for most maintenance tools for example. So yes, it is a different use case. -- Johannes Kroll Softwareentwickler Wikimedia Deutschland e.V. | Tempelhofer Ufer 23-24 | 10963 Berlin Tel. (030) 219 158 26-0 http://wikimedia.de Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e.V. Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg unter der Nummer 23855 B. Als gemeinnützig anerkannt durch das Finanzamt für Körperschaften I Berlin, Steuernummer 27/681/51985.

Sebastiano Vigna

4:03 p.m.

On 19 Dec 2013, at 4:28 AM, Johannes Kroll <johannes.kroll(a)wikimedia.de> wrote:

...

"Real-world" means little: the world is large--people have different needs. Having the graph, embedded, with high-access speed is different than depending on a service with a fixed number of primitives. What if I want to know a centrality measure? Will you implement it for me? If I have to fetch successor lists and compute it by myself it will be 100-1000x slower. If I ask for a successor list, how much time per arc, overall, will it take? This is the standard measure for the speed of a graph representation. I can't evince anything from the example you quote.

...

"Deliver an edge in 50ns" sounds impressive, but this value doesn't mean much without context. What does it mean?

Most basic graph traversal algorithms are linear in the number of arcs traversed. Thus, a standard and informative measure of the speed of access to a graph (see any paper on the subject) is how much time it takes to get a successor (say, from an iterator providing the successors of a node). You don't need any context. Note that it's you claiming that CatGraph is the service I need. I simply think is a service with a different goal. I'm sure people love it. Ciao, seba

Johannes Kroll

5:10 p.m.

On Thu, 19 Dec 2013 08:03:20 -0800 Sebastiano Vigna <vigna(a)di.unimi.it> wrote:

...

On 19 Dec 2013, at 4:28 AM, Johannes Kroll <johannes.kroll(a)wikimedia.de> wrote:

If you want to do centrality measures on *current* data, you will either have to implement it yourself (CatGraph is free software after all), or wait for me to implement it.

...

If I have to fetch successor lists and compute it by myself it will be 100-1000x slower. If I ask for a successor list, how much time per arc, overall, will it take? This is the standard measure for the speed of a graph representation. I can't evince anything from the example you quote.

I think you can. You can even run any query yourself. Try something like: curl http://sylvester.wmflabs.org:8090/dewiki/traverse-successors+235276+9999 | head You'll get something like: OK. 102243 nodes, 0.160605s: 235276 338 464 1704 [...] Now you can calculate your time per arc. This, again, includes setting up the connection and converting the numbers to a line-based format for transfer. So it isn't directly comparable to your numbers. And what for, anyway? It doesn't make *any* difference to a user of a web tool whether they wait for 0.01 seconds, or 0.09 seconds longer. I can hardly justify spending any more time on optimizing this, since there is no point. You didn't mention how I could reproduce your 50ns number.

...

"Deliver an edge in 50ns" sounds impressive, but this value doesn't mean much without context. What does it mean?

No, not at all. I saw this thread and said you might be interested, and I wasn't addressing *you* or any single person in particular. I didn't want to start a pissing contest either. We are providing a service that enables tool authors and others to traverse graphs representing *current* data from Wikipedia quickly and easily. We need an implementation that scales well and can have its data constantly updated, in a central place, without using too much memory while it runs. You, on the other hand, want to do research on data that can be months old. This is a different use case. But we already said that. Both. Twice. -- Johannes Kroll Softwareentwickler Wikimedia Deutschland e.V. | Tempelhofer Ufer 23-24 | 10963 Berlin Tel. (030) 219 158 26-0 http://wikimedia.de Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e.V. Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg unter der Nummer 23855 B. Als gemeinnützig anerkannt durch das Finanzamt für Körperschaften I Berlin, Steuernummer 27/681/51985.

Sebastiano Vigna

5:16 p.m.

On 19 Dec 2013, at 9:10 AM, Johannes Kroll <johannes.kroll(a)wikimedia.de> wrote:

...

1570ns/arc. Do you think we computed the degrees of separation of Facebook using a service like that? Of a 700M nodes/69B edges graph? You have no idea of what's "scaling". Read the "Four degrees of separation" paper. Ciao, seba

Johannes Kroll

5:31 p.m.

On Thu, 19 Dec 2013 09:16:46 -0800 Sebastiano Vigna <vigna(a)di.unimi.it> wrote:

...

On 19 Dec 2013, at 9:10 AM, Johannes Kroll <johannes.kroll(a)wikimedia.de> wrote:

1570ns/arc.

As opposed to your number which I still can't reproduce...

...

Do you think we computed the degrees of separation of Facebook using a service like that? Of a 700M nodes/69B edges graph?

No, I don't. And why on earth would you want to use CatGraph to do that? Did you even read my last message before replying to it? CatGraph is a tool with specific purposes. Computing degrees of separation of Facebook is definately not one of them. Now, please, don't reply unless you actually read what I wrote. -- Johannes Kroll Softwareentwickler Wikimedia Deutschland e.V. | Tempelhofer Ufer 23-24 | 10963 Berlin Tel. (030) 219 158 26-0 http://wikimedia.de Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e.V. Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg unter der Nummer 23855 B. Als gemeinnützig anerkannt durch das Finanzamt für Körperschaften I Berlin, Steuernummer 27/681/51985.

Maarten Dammers

22 Dec 22 Dec

12:42 p.m.

New subject: Catgraph for Commons categories

Hi Johannes, Forking discussion for a specific subset. Johannes Kroll schreef op 19-12-2013 18:10:

...

On Commons Daniel Schwen started two discussions related to categories: * https://commons.wikimedia.org/wiki/Commons:Village_pump#Category_.22tree.22… * https://commons.wikimedia.org/wiki/Commons:Village_pump#Commons_Category_In… I used to have a simple child->parent database at the Toolserver, it broke down and if Catgraph (https://wikitech.wikimedia.org/wiki/Nova_Resource:Catgraph) is a good alternative, I'd rather use that. Use cases (all at Wikimedia Commons): * Make a report of loops. I want a report for each length: 0 being self categorized, 1 being A->B->A, 2 being A->B->C->A etc. * Give a number of categories and filter out overcategorization (Category:Berlin + Category:Germany -> Category:Berlin) Can this be done with catgraph? Maarten

Johannes Kroll

7 Jan 7 Jan

3:12 p.m.

New subject: Catgraph for Commons categories

Hi Maarten, first, sorry for the late reply, I was on holiday. On Sun, 22 Dec 2013 13:42:25 +0100 Maarten Dammers <maarten(a)mdammers.nl> wrote:

...

[...] On Commons Daniel Schwen started two discussions related to categories: * https://commons.wikimedia.org/wiki/Commons:Village_pump#Category_.22tree.22… * https://commons.wikimedia.org/wiki/Commons:Village_pump#Commons_Category_In… I used to have a simple child->parent database at the Toolserver, it broke down and if Catgraph (https://wikitech.wikimedia.org/wiki/Nova_Resource:Catgraph) is a good alternative, I'd rather use that. Use cases (all at Wikimedia Commons): * Make a report of loops. I want a report for each length: 0 being self categorized, 1 being A->B->A, 2 being A->B->C->A etc. * Give a number of categories and filter out overcategorization (Category:Berlin + Category:Germany -> Category:Berlin) Can this be done with catgraph?

I think so. This is a little tool I wrote to find cycles in WP categories: http://tools.wmflabs.org/render-tests/catcycle-dev/catcycle.py But we don't have Commons categories yet. I'll look into importing commons, then get back to you. Cheers, Johannes -- Johannes Kroll Softwareentwickler Wikimedia Deutschland e.V. | Tempelhofer Ufer 23-24 | 10963 Berlin Tel. (030) 219 158 26-0 http://wikimedia.de Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e.V. Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg unter der Nummer 23855 B. Als gemeinnützig anerkannt durch das Finanzamt für Körperschaften I Berlin, Steuernummer 27/681/51985.

Johannes Kroll

7:30 p.m.

New subject: Catgraph for Commons categories

Hi, On Sun, 22 Dec 2013 13:42:25 +0100 Maarten Dammers <maarten(a)mdammers.nl> wrote:

...

[...] Use cases (all at Wikimedia Commons): * Make a report of loops. I want a report for each length: 0 being self categorized, 1 being A->B->A, 2 being A->B->C->A etc. * Give a number of categories and filter out overcategorization (Category:Berlin + Category:Germany -> Category:Berlin) Can this be done with catgraph?

some progress on this: I imported Commons categories. To find cycles, try this: http://tools.wmflabs.org/render-tests/catcycle-dev/catcycle.py?action=find-… This uses CommonsRoot, which is supposed to be "the top level node in Commons tree data structure from which every other category is accessible" as a start node, and looks for cycles with (practically) unlimited depth. From the amount of nodes visited, it looks like about 2500 categories are not reachable from CommonsRoot (total Commons categories: 3163665, nodes visited: 3166147). So the query should find most of the cycles, but possibly not every last one of them. It's a good start to look for cycles in any case. The page does take about half a minute or so to load. One way to find some unreachable categories is the "Find root nodes" function in CatCycle. Root nodes are categories without any parent category. As I understand it, there shouldn't be any root nodes except CommonsRoot by definition. I've set the refresh interval of the Commons graph to 4 hours, so this is the maximum age of the data you get (plus replication lag of the Labs replica). I can shorten that if it's needed. Does that help so far? For intersections of categories of actual files, it's different. Currently we have one VM where we keep all graphs we imported. Commons is rather large, and the whole graph including leaves (=files) won't fit into the RAM of this host. This means I will have to distribute the graphs into several VMs first. This was planned anyway, I just haven't done it yet. Cheers, Johannes -- Johannes Kroll Softwareentwickler Wikimedia Deutschland e.V. | Tempelhofer Ufer 23-24 | 10963 Berlin Tel. (030) 219 158 26-0 http://wikimedia.de Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e.V. Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg unter der Nummer 23855 B. Als gemeinnützig anerkannt durch das Finanzamt für Körperschaften I Berlin, Steuernummer 27/681/51985.

Johannes Kroll

11 Jan 11 Jan

1:45 a.m.

New subject: Catgraph for Commons categories

Another update on this: today I imported the File and Category namespaces of commonswiki into another Catgraph instance. This is necessary for doing category intersections that include images, and finding overcategorization. The new graph runs on a separate VM. It's not directly accessible because the frontend tools don't know of the new host yet. More on this later. On Tue, 7 Jan 2014 20:30:42 +0100 Johannes Kroll <johannes.kroll(a)wikimedia.de> wrote:

...

Hi, On Sun, 22 Dec 2013 13:42:25 +0100 Maarten Dammers <maarten(a)mdammers.nl> wrote:

-- Johannes Kroll Softwareentwickler Wikimedia Deutschland e.V. | Tempelhofer Ufer 23-24 | 10963 Berlin Tel. (030) 219 158 26-0 http://wikimedia.de Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e.V. Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg unter der Nummer 23855 B. Als gemeinnützig anerkannt durch das Finanzamt für Körperschaften I Berlin, Steuernummer 27/681/51985.

Daniel Mietchen

17 Dec 17 Dec

6:32 p.m.

My bot at http://commons.wikimedia.org/wiki/User:Open_Access_Media_Importer_Bot has been struggling with both under- and overcategorization, as it attempts to categorize its uploads based on the metadata provided by the journals. There is little we can do against undercategorization at upload, but I would like to use CategorizationBot 's approach of inferring categories from pages using the file, and if there were a usable filter against overcategorization that took into account existing Commons categories, I would certainly give it a try. Another challenge with the bot was that thousands of the categories it had proposed did not exist on Commons yet, and separating the ones that were compatible with Commons category structures from those that were not was not trivial - in the end, I created (and categorized) most of the necessary categories manually. You can see some remnants of that in the subcategories of https://commons.wikimedia.org/wiki/Category:Uploaded_with_Open_Access_Media… . Any thoughts about that would be appreciated. Daniel -- http://www.naturkundemuseum-berlin.de/en/institution/mitarbeiter/mietchen-d… https://en.wikipedia.org/wiki/User:Daniel_Mietchen/Publications http://okfn.org http://wikimedia.org On Sun, Dec 15, 2013 at 12:30 PM, Maarten Dammers <maarten(a)mdammers.nl> wrote:

...

3783

days inactive

3815

days old

analytics@lists.wikimedia.org

Manage subscription

33 comments

8 participants

tags (0)

participants (8)

Aaron Halfaker
Dan Andreescu
Daniel Mietchen
Dario Taraborelli
Diego Ceccarelli
Johannes Kroll
Maarten Dammers
Sebastiano Vigna