Hi,
Amir Aharoni and I thought that this might be interesting for people here.
We wanted to answer the following question: for each language, how many of the articles in the main namespace that appear in one Wikipedia (e.g., FR) also appear in another (e.g., EN). We calculated this as the percentage of articles that exist in two languages from the total number of articles of one of the languages (taken from [1]). That is, we calculated the intersection(EN, FR)/Count(FR)). We did this for all of the languages (287^2) [2].
Results: 1. The co-exist matrix of counts can be found Google Spreadsheet https://docs.google.com/spreadsheets/d/1wj3fPkU8v2-KcEjTNtFLabMTXWyhgvRhgywhPjnRLNY/edit?usp=sharing - It was generated on 01/09/2015 using the langlinks table of every wiki. The underlining query is -based on this code: (%s is the wiki code)
SELECT '%s' as source, ll_lang as target, COUNT(*) as count FROM %s_p.langlinks LEFT JOIN %s_p.page
ON page_id = ll_from WHERE page_namespace = 0 GROUP BY ll_lang;
- The links are not symmetrical. there is on average less than one percent difference between the links from lang A to B compared to lang B to A. - However, it wasn't perfect. Wikis with less than 3500 links (that means the has less than 100 articles) have on average more than 20% out links (that is, taken from that language langlink table) than in links (other wikis pointing at that language). - As the number of langlinks gets bigger (and for most cases, the side of the wiki), the difference and variance between the in and out links gets smaller. - Some out links pointed to mistakes (zh-cn, zh-tw, nn) - is fixed. - The raw data can be sent on request.
2. A heat map of the co-exist wikis with more than 50,000 articles. It is ordered by size. As I mentioned, the above triangles are not symmetrical because the counts (which are themselves not equal but are close enough) are divided by the number of articles in each wiki. The heat map is between Red - high level of congruence to Yellow - low level. [image: Inline image 1] Points to notice: 1. Most languages have strong connections with English. 2. There is a group of interconnected wikis that are based on Swedish (Dutch, Waray-Waray, Cebuano, Vietnamese, Indonesian, Minangkabau). 3. Piedmontese is highly interconnected with Latin languages, as do Latin itself. On the other hand, Chechen is mostly connected to Russian. 4. a. Arabic has 8% more in links than out.There isn't one Wiki that caused this difference, so it's not a bot. 5. Telugu doesn't have many interlinks, not to English, Hindi or Bengali. 6. There are other visible strong connections (as Serbian and Serbo-Croatian) but they are not as surprising.
Thoughts?
Cheers, Neta
[1] meta.wikimedia.org/wiki/List_of_Wikipedias updated on 01/12/2015. [2] You might be wondering why did we calculated both EN-> FR and FR-> EN as there is a 1 to 1 connection between the interlanguage links in Wikidata? We used the data from the langlinks table for every Wikipedia and not from the wiki interlanguage link table. We did so for two reasons: 1) it was computationally easier 2) we wanted to see if there are any irregulars in the data.