Hi,
Amir Aharoni and I thought that this might be interesting for people here.
We wanted to answer the following question: for each language, how many of the articles in the main namespace that appear in one Wikipedia (e.g., FR) also appear in another (e.g., EN). We calculated this as the percentage of articles that exist in two languages from the total number of articles of one of the languages (taken from [1]). That is, we calculated the intersection(EN, FR)/Count(FR)). We did this for all of the languages (287^2) [2].
Results: 1. The co-exist matrix of counts can be found Google Spreadsheet https://docs.google.com/spreadsheets/d/1wj3fPkU8v2-KcEjTNtFLabMTXWyhgvRhgywhPjnRLNY/edit?usp=sharing - It was generated on 01/09/2015 using the langlinks table of every wiki. The underlining query is -based on this code: (%s is the wiki code)
SELECT '%s' as source, ll_lang as target, COUNT(*) as count FROM %s_p.langlinks LEFT JOIN %s_p.page
ON page_id = ll_from WHERE page_namespace = 0 GROUP BY ll_lang;
- The links are not symmetrical. there is on average less than one percent difference between the links from lang A to B compared to lang B to A. - However, it wasn't perfect. Wikis with less than 3500 links (that means the has less than 100 articles) have on average more than 20% out links (that is, taken from that language langlink table) than in links (other wikis pointing at that language). - As the number of langlinks gets bigger (and for most cases, the side of the wiki), the difference and variance between the in and out links gets smaller. - Some out links pointed to mistakes (zh-cn, zh-tw, nn) - is fixed. - The raw data can be sent on request.
2. A heat map of the co-exist wikis with more than 50,000 articles. It is ordered by size. As I mentioned, the above triangles are not symmetrical because the counts (which are themselves not equal but are close enough) are divided by the number of articles in each wiki. The heat map is between Red - high level of congruence to Yellow - low level. [image: Inline image 1] Points to notice: 1. Most languages have strong connections with English. 2. There is a group of interconnected wikis that are based on Swedish (Dutch, Waray-Waray, Cebuano, Vietnamese, Indonesian, Minangkabau). 3. Piedmontese is highly interconnected with Latin languages, as do Latin itself. On the other hand, Chechen is mostly connected to Russian. 4. a. Arabic has 8% more in links than out.There isn't one Wiki that caused this difference, so it's not a bot. 5. Telugu doesn't have many interlinks, not to English, Hindi or Bengali. 6. There are other visible strong connections (as Serbian and Serbo-Croatian) but they are not as surprising.
Thoughts?
Cheers, Neta
[1] meta.wikimedia.org/wiki/List_of_Wikipedias updated on 01/12/2015. [2] You might be wondering why did we calculated both EN-> FR and FR-> EN as there is a 1 to 1 connection between the interlanguage links in Wikidata? We used the data from the langlinks table for every Wikipedia and not from the wiki interlanguage link table. We did so for two reasons: 1) it was computationally easier 2) we wanted to see if there are any irregulars in the data.
Nice, is there a higher resolution version of the image? I'm having difficulties reading it.
Neta Livneh, 18/01/2015 18:53:
- There is a group of interconnected wikis that are based on Swedish
(Dutch, Waray-Waray, Cebuano, Vietnamese, Indonesian, Minangkabau).
Looks like a list of Lsjbot friends.
- Telugu doesn't have many interlinks, not to English, Hindi or Bengali.
Wasn't te.wiki one of the wikis interested by Google Translator Toolkit experiments? "Normal" translators tend to add interlinks, but I have no idea about that kind of translator.
Nemo
2015-01-18 19:06 GMT+01:00 Federico Leva (Nemo) nemowiki@gmail.com:
Nice, is there a higher resolution version of the image? I'm having difficulties reading it.
Neta Livneh, 18/01/2015 18:53:
- There is a group of interconnected wikis that are based on Swedish
(Dutch, Waray-Waray, Cebuano, Vietnamese, Indonesian, Minangkabau).
Looks like a list of Lsjbot friends.
Or to be more specific, languages that have articles about a lot of species (not all are LsjBot).
/Jan
I think this is a better version.
Neta
On Sun, Jan 18, 2015 at 8:06 PM, Federico Leva (Nemo) nemowiki@gmail.com wrote:
Nice, is there a higher resolution version of the image? I'm having difficulties reading it.
Neta Livneh, 18/01/2015 18:53:
- There is a group of interconnected wikis that are based on Swedish
(Dutch, Waray-Waray, Cebuano, Vietnamese, Indonesian, Minangkabau).
Looks like a list of Lsjbot friends.
- Telugu doesn't have many interlinks, not to English, Hindi or Bengali.
Wasn't te.wiki one of the wikis interested by Google Translator Toolkit experiments? "Normal" translators tend to add interlinks, but I have no idea about that kind of translator.
Nemo
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Neta Livneh, 18/01/2015 19:57:
I think this is a better version.
Thanks. I think the way to read this graph is that it's naturally darker below the diagonal line, and fairer above it. In fact, position (x, y) is the percentage of articles in wiki x which also exist in wiki y. If y > x we can't reach 100 %; for y >> x, we approach zero. So, the things worth noting are mostly the dark areas above the line and white areas below the line. Well known botpedias (ceb and war) clearly stand out. At a lesser extent also nl, sv. If you ordered the wikis by pageviews (as per www.wikipedia.org top 10) the shade would look more natural (but we'd lose information, unless you redefined the colouring). A non-mystery is the strong correlation between sh and sr: that's basically the same language and they have a similar size. A weird thing is the status of "min": you'd expect it to have some stronger correlation to zh; I'd call that a gap to fill. The horizontal lines for ja, vi also stand out: we rarely see users from those wikis, they're more isolated. The vertical lines above (uz, vo) come often with surprises: probably some common bulk of bot-created articles. The dark spots in the vertical line above pms is an antology of secessionist/regional/nostalgic languages; not a surprise given the interests of the core editors.
Nemo
Hi Nemo,
Thanks for the comments and inputs! - I agree with how you looked at the graph, the triangle below the diagonal is more interesting than the above one, as it contains more information, except for languages that are darken in the above triangle. - I was surprised by the clustering (Swidish, Dutch, Waray-Waray, Cebuano, Vietnamese, Indonesian, Minangkabau) but if is a bot that created it, than it makes sense. - I can try ordering the Wikipedias by page views, it might put emphasis on the real activity and not only on the size (or bot generated links/ pages). Actually, I can change the y-axis to be ordered by page views instead of articles so we won't lose information. - Another point I didn't mentioned is that there are small languages (not appearing in the heat map) with unproportional number of linked pages compared to number of articles. This is due to (I think) bot generated articles that don't have interlinks in the text so they are not counted as articles.
Best, Neta
On Mon, Jan 19, 2015 at 12:53 AM, Federico Leva (Nemo) nemowiki@gmail.com wrote:
Neta Livneh, 18/01/2015 19:57:
I think this is a better version.
Thanks. I think the way to read this graph is that it's naturally darker below the diagonal line, and fairer above it. In fact, position (x, y) is the percentage of articles in wiki x which also exist in wiki y. If y > x we can't reach 100 %; for y >> x, we approach zero. So, the things worth noting are mostly the dark areas above the line and white areas below the line. Well known botpedias (ceb and war) clearly stand out. At a lesser extent also nl, sv. If you ordered the wikis by pageviews (as per www.wikipedia.org top 10) the shade would look more natural (but we'd lose information, unless you redefined the colouring). A non-mystery is the strong correlation between sh and sr: that's basically the same language and they have a similar size. A weird thing is the status of "min": you'd expect it to have some stronger correlation to zh; I'd call that a gap to fill. The horizontal lines for ja, vi also stand out: we rarely see users from those wikis, they're more isolated. The vertical lines above (uz, vo) come often with surprises: probably some common bulk of bot-created articles. The dark spots in the vertical line above pms is an antology of secessionist/regional/nostalgic languages; not a surprise given the interests of the core editors.
Nemo
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics