Interwiki links analysis tool

List overview All Threads
Download

newer

older

Re: [WikiEN-l] Unreliable sources,...

Music fans choose Wikipedia over...

Lukasz Bolikowski

17 Mar 2008 17 Mar '08

5:46 p.m.

Hi,

I've written a visual tool for analyzing the graph of interlanguage links between all 256 editions of Wikipedia.

Its main advantages, compared to bots, are: * it analyzes the whole inconsistent component at once, while bots tend to work "locally" (in some neighborhood of an article); * cool (IMHO) graph visualization; * concrete recommendations: remove a link, split an article, merge articles, remove redirects.

To stress the advantage of "global" vs. "local" analysis of a component: the largest connected component in the graph contains over 48'000 articles, mixing over 2'500 different subjects. Some of the sources of semantic drift in such components are not visible "locally".

Main disadvantages: * works on preprocessed dumps, instead of "live" Wikipedia, so the recommendations may be outdated; * (for the moment) does not recognize some of the redirects, due to poor quality of redirect dumps. Apparently I'm not the only one affected by the problem, and the guys at wikitech-l are aware of the issue; * Requires Java 6, eats a lot of resources (512M seems to be enough even for the largest case); * Doesn't change anything (points to the possible sources of problems instead).

The tool is far from being complete, "prototype" would be a more appropriate name here (its original purpose was to help me evaluate some ideas for my PhD). Please try it and send me your feedback, I'd like to make it more useful for the community.

You can find the tool here: http://wikitools.icm.edu.pl/

Regards, Bolo1729

Show replies by date

Brian

18 Mar 18 Mar

10:46 p.m.

Very cool, nice work, especially with the interwiki integration. I've messed around a bit with viewing mediawiki link graphs using Cytoscape, software designed for visualizing molecular interactions. But only a _very_ modern machine (I'm talking about a maxed out Mac Pro here :) could handle the english wikipedia and certainly not the entire interwiki link graph as you have done. But you can do lots of different kinds of visualizations with it. Here's a good description:

http://www.mkbergman.com/?p=415

I also attached a Cytoscape file that will let you visualize Scholarpedia. In case the list scrubs it: http://filebin.ca/zmtoy/scholarpedia.cys

(not to segue your thread - i thought you would find it interesting :)

Cheers, Brian

On Mon, Mar 17, 2008 at 4:46 PM, Lukasz Bolikowski bolo@icm.edu.pl wrote:

...

Hi,

I've written a visual tool for analyzing the graph of interlanguage links between all 256 editions of Wikipedia.

Its main advantages, compared to bots, are:

it analyzes the whole inconsistent component

at once, while bots tend to work "locally" (in some neighborhood of an article);

cool (IMHO) graph visualization;

concrete recommendations: remove a link, split

an article, merge articles, remove redirects.

To stress the advantage of "global" vs. "local" analysis of a component: the largest connected component in the graph contains over 48'000 articles, mixing over 2'500 different subjects. Some of the sources of semantic drift in such components are not visible "locally".

Main disadvantages:

works on preprocessed dumps, instead of "live"

Wikipedia, so the recommendations may be outdated;

(for the moment) does not recognize some of the

redirects, due to poor quality of redirect dumps. Apparently I'm not the only one affected by the problem, and the guys at wikitech-l are aware of the issue;

Requires Java 6, eats a lot of resources (512M

seems to be enough even for the largest case);

Doesn't change anything (points to the possible

sources of problems instead).

The tool is far from being complete, "prototype" would be a more appropriate name here (its original purpose was to help me evaluate some ideas for my PhD). Please try it and send me your feedback, I'd like to make it more useful for the community.

You can find the tool here: http://wikitools.icm.edu.pl/

Regards, Bolo1729

WikiEN-l mailing list WikiEN-l@lists.wikimedia.org To unsubscribe from this mailing list, visit: https://lists.wikimedia.org/mailman/listinfo/wikien-l

Steve Bennett

19 Mar 19 Mar

12:39 a.m.

On 3/18/08, Lukasz Bolikowski bolo@icm.edu.pl wrote:

...

I've written a visual tool for analyzing the graph of interlanguage links between all 256 editions of Wikipedia.

Sounds good, but with the requirement for local dumps, it would be extremely difficult to get up and running for me, and probably most people. Have you considered going the Magnus Manske direction and writing a web based tool instead? It would be a shame to have such a powerful and interesting tool be used by so few people.

Steve

Steve Bennett

12:44 a.m.

On 3/19/08, Steve Bennett stevagewp@gmail.com wrote:

...

Sounds good, but with the requirement for local dumps, it would be extremely difficult to get up and running for me, and probably most people. Have you considered going the Magnus Manske direction and writing a web based tool instead? It would be a shame to have such a powerful and interesting tool be used by so few people.

Oh, actually it wasn't very clear in the original post, but you just download the application file and the rest happens automatically. And of course (stupid me) you don't need like the 3 gig full article text, it's only 12mb or something.

So much easier to download and play with than I thought, sorry.

Steve

Steve Bennett

12:59 a.m.

It's pretty cool, but the text sizing seems odd - all of the text labels are so small that to read it, I need to zoom in so far I can't really see the graph anymore. Everything seems to be on some sort of astronomical scale.

For example, my first test is "bulldog". Now the entire "bulldog" cluster of links is shown as a bundle that overlaps on the "ull" of the word. At the other end of the scale, to fit the entire graph of bundles on the screen (from Bulldog (disambiguation) to tr:Buldok (anlam ayrim)), I need to zoom out so far that no text is readable at all. It doesn't seem helpful to place these nodes so far apart, and to create orders of magnitudes of difference in scaling between article labels and these clusters labels.

Also, it would be nice in the documentation if there were suggestions for how to use it. We are not all seasoned interwiki surfers :) Some examples of "look for X, then do Y" would be great. And of course, being able to click on an article and have it retrieve the article text for comparison would be very useful.

Steve

Steve Bennett

1:31 a.m.

Sorry to keep spamming, but can you explain what's going on in the "French Guyana" graph? The recommendations seem bad: it suggests removing what look like correct links. e.g.:

* [[:gl:Güiana francesa]] ---> [[:no:Fransk Guyana]] * [[:gl:Güiana francesa]] ---> [[:en:French Guiana]] * [[:gl:Güiana francesa]] ---> [[:simple:French Guiana]] * [[:gl:Güiana francesa]] ---> [[:ga:Guáin na Fraince]]

Also, something is clearly wrong with the Arabic article, but from the graph, it's not clear what. I think perhaps the article "جويانا" links to [[Guyana]] but that in turn links to "غويانا". You have this strong sense that the graph is trying to tell you *something* (notice the huge pretty pattern around that article), but it's not obvious what it is.

Also, for performance, it would be useful to be able to hide all grey (normal) links, and just show the ones that need attention.

Steve

Lukasz Bolikowski

7:41 a.m.

Steve Bennett yazmış:

...

Sorry to keep spamming, but can you explain what's going on in the "French Guyana" graph? The recommendations seem bad: it suggests removing what look like correct links. e.g.:

[[:gl:Güiana francesa]] ---> [[:no:Fransk Guyana]]

[[:gl:Güiana francesa]] ---> [[:en:French Guiana]]

[[:gl:Güiana francesa]] ---> [[:simple:French Guiana]]

[[:gl:Güiana francesa]] ---> [[:ga:Guáin na Fraince]]

I don't know about others, but for me this is not a spam, but a very valuable feedback. Keep spamming ;) Now to the point.

It is not a wrong recommendation, rather a poor way of communicating an observation. These recommendations are due to the fact, that there's already [[:gl:Güiana Francesa - Guyane française]] connected to the "French Guiana" group, and the latter has stronger connections (in both directions). This is actually a "merge" pattern. More on this in the next paragraph.

...

Also, something is clearly wrong with the Arabic article, but from the graph, it's not clear what. I think perhaps the article "جويانا" links to [[Guyana]] but that in turn links to "غويانا". You have this strong sense that the graph is trying to tell you *something* (notice the huge pretty pattern around that article), but it's not obvious what it is.

The Arabic article close to "Guyana" is outside, because there's another Arabic article inside, which has even stronger ties with the group, so this is actually a "merge" pattern. If my pattern-detection heuristics were a bit smarter, they would paint the outsider's links green.

Hint: in "merge" patterns, the other article in the same language usually appears somewhere on the opposite side of the group. This is because node placement is simple mechanics, with harmonic potential (trying to keep a fixed distance for each link) and repulsive potential (trying to separate same-language articles as much as possible).

Addressing your remark from the previous mail: it is difficult to find a balance between the harmonic and repulsive potentials. When the harmonic is too powerful, then the different clusters are too close together and the cluster labels overlap. When the repulsive is too strong, then one observes the astronomical distances, as you've put it.

My hint here: after Ctrl-A, when you realize that the distances are too big, do the following: 1) go to the Positions dialog, press Seed 2) open the dialog again and calculate positions with a lower (try 10 times lower) "Repulsive potential constant"

...

Also, for performance, it would be useful to be able to hide all grey (normal) links, and just show the ones that need attention.

I couldn't agree more. This is what kills the performance the most. Ultimately, I'd like to replace the clusters of gray links with gray polygons (perhaps a bit lighter gray). I should also change the "layering" so that the gray polygons are behind cluster labels and thus the labels are readable.

Thanks again for all your comments!

Regards, Bolo

White Cat

21 Mar 21 Mar

4:08 p.m.

This is excellent. We desperately need this on wikiesource and wikibooks. You might have saved us from a lot of trouble.

On Tue, Mar 18, 2008 at 12:46 AM, Lukasz Bolikowski bolo@icm.edu.pl wrote:

...

Hi,

I've written a visual tool for analyzing the graph of interlanguage links between all 256 editions of Wikipedia.

Its main advantages, compared to bots, are:

it analyzes the whole inconsistent component

at once, while bots tend to work "locally" (in some neighborhood of an article);

cool (IMHO) graph visualization;

concrete recommendations: remove a link, split

an article, merge articles, remove redirects.

To stress the advantage of "global" vs. "local" analysis of a component: the largest connected component in the graph contains over 48'000 articles, mixing over 2'500 different subjects. Some of the sources of semantic drift in such components are not visible "locally".

Main disadvantages:

works on preprocessed dumps, instead of "live"

Wikipedia, so the recommendations may be outdated;

(for the moment) does not recognize some of the

redirects, due to poor quality of redirect dumps. Apparently I'm not the only one affected by the problem, and the guys at wikitech-l are aware of the issue;

Requires Java 6, eats a lot of resources (512M

seems to be enough even for the largest case);

Doesn't change anything (points to the possible

sources of problems instead).

The tool is far from being complete, "prototype" would be a more appropriate name here (its original purpose was to help me evaluate some ideas for my PhD). Please try it and send me your feedback, I'd like to make it more useful for the community.

You can find the tool here: http://wikitools.icm.edu.pl/

Regards, Bolo1729

WikiEN-l mailing list WikiEN-l@lists.wikimedia.org To unsubscribe from this mailing list, visit: https://lists.wikimedia.org/mailman/listinfo/wikien-l

Lukasz Bolikowski

25 Mar 25 Mar

5:18 a.m.

White Cat yazmış:

...

This is excellent. We desperately need this on wikiesource and wikibooks. You might have saved us from a lot of trouble.

Thanks, I'm very happy to hear this! I'll add support for Wikisources and Wikibooks (but probably around late April).

Regards, Bolo1729

PS. Both of my posts to this list appeared 2-3 days after sending. Is there a way to speed up the moderation?

5964

Age (days ago)

5972

Last active (days ago)

wikien-l@lists.wikimedia.org

8 comments

4 participants

tags (0)

participants (4)

Brian
Lukasz Bolikowski
Steve Bennett
White Cat