Fwd: Calculating interlinks between Wikipedias - Analytics

18 Jan 2015


      Hi,
Amir Aharoni and I thought that this might be interesting for people here.
We wanted to answer the following question: for each language, how many of
the articles in the main namespace that appear in one Wikipedia (e.g., FR)
also appear in another (e.g., EN). We calculated this as the percentage of
articles that exist in two languages from the total number of articles of
one of the languages  (taken from [1]). That is, we calculated the
intersection(EN, FR)/Count(FR)). We did this for all of the languages
(287^2) [2].
Results:
1. The co-exist matrix of counts can be found Google Spreadsheet
https://docs.google.com/spreadsheets/d/1wj3fPkU8v2-KcEjTNtFLabMTXWyhgvRhgywhPjnRLNY/edit?usp=sharing
- It was generated on 01/09/2015 using the langlinks table of every wiki.
The underlining query is -based on this code: (%s is the wiki code)
SELECT '%s' as source, ll_lang as target, COUNT(*) as count FROM
%s_p.langlinks LEFT JOIN %s_p.page
ON page_id = ll_from
WHERE page_namespace = 0
GROUP BY ll_lang;
- The links are not symmetrical. there is on average less than one percent
difference between the links from lang A to B compared to lang B to A.
- However, it wasn't perfect. Wikis with less than 3500 links (that means
the has less than 100 articles) have on average more than 20% out links
(that is, taken from that language langlink table) than in links (other
wikis pointing at that language).
- As the number of langlinks gets bigger (and for most cases, the side of
the wiki), the difference and variance between the in and out links gets
smaller.
- Some out links pointed to mistakes (zh-cn, zh-tw, nn) - is fixed.
- The raw data can be sent on request.
2. A heat map of the co-exist wikis with more than 50,000 articles. It is
ordered by size. As I mentioned, the above triangles are not symmetrical
because the counts (which are themselves not equal but are close enough)
are divided by the number of articles in each wiki. The heat map is between
Red - high level of congruence to Yellow - low level.
[image: Inline image 1]
Points to notice:
1. Most languages have strong connections with English.
2. There is a group of interconnected wikis that are based on Swedish
(Dutch, Waray-Waray, Cebuano, Vietnamese, Indonesian, Minangkabau).
3. Piedmontese is highly interconnected with Latin languages, as do Latin
itself. On the other hand, Chechen is mostly connected to Russian.
4. a. Arabic has 8% more in links than out.There isn't one Wiki that caused
this difference, so it's not a bot.
5. Telugu doesn't have many interlinks, not to English, Hindi or Bengali.
6. There are other visible strong connections (as Serbian and
Serbo-Croatian) but they are not as surprising.
Thoughts?
Cheers,
Neta
[1] meta.wikimedia.org/wiki/List_of_Wikipedias updated on 01/12/2015.
[2] You might be wondering why did we calculated both EN-> FR and FR-> EN
as there is a 1 to 1 connection between the interlanguage links in Wikidata?
We used the data from the langlinks table for every Wikipedia and not from
the wiki interlanguage link table. We did so for two reasons: 1) it
was computationally easier 2) we wanted to see if there are any irregulars
in the data.