Hi,
I'm doing a project with the content translation team as part of my FOSS OPW internship.Our goal is to understand the proportion of translated pages in some wiki from other wikipedias.
As a first step, I need to create a list of of articles in one language (i.e. HE) that have corresponding articles in another language, starting with English.
I wonder what is the best why to create this list. Possible approaches that I thought of: 1. Using the API Sandbox and some iterative script that calls it. 2. Using the Wikimedia Dumps (specifically wiki interlanguage link records). 3. Using the Wikidata dumps (specifically wikidatawiki-latest-langlinks).
Am I missing something? Which way is the best to build the list? especially when taking into account the possibility of inline interlanguage links?
Thanks for your help, Neta
Neta Livneh, 18/12/2014 13:35:
As a first step, I need to create a list of of articles in one language (i.e. HE) that have corresponding articles in another language, starting with English.
Sounds like https://tools.wmflabs.org/not-in-the-other-language/ I recommend that you send patches for the existing tool.
Nemo
Hi Nemo,
I think this tool actually does the opposite as what I need as I want to get all the pages that are both in HE and in EN. Obviously, I can take all the articles in Hebrew and subtract the tool's results, but I wonder if that's the best way to go.
Thanks, Neta
On Thu, Dec 18, 2014 at 5:09 PM, Federico Leva (Nemo) nemowiki@gmail.com wrote:
Neta Livneh, 18/12/2014 13:35:
As a first step, I need to create a list of of articles in one language (i.e. HE) that have corresponding articles in another language, starting with English.
Sounds like https://tools.wmflabs.org/not-in-the-other-language/ I recommend that you send patches for the existing tool.
Nemo
Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
Neta Livneh, 18/12/2014 19:04:
I think this tool actually does the opposite as what I need as I want to get all the pages that are both in HE and in EN. Obviously, I can take all the articles in Hebrew and subtract the tool's results, but I wonder if that's the best way to go.
Ok. You can still reuse its code/approach.
Nemo
Good Idea, I will have a look at the code.
Does it mean that there are not a lot of inline interlanguage links (that is, information that is not in wikidata, or contradicts wikidata) so it is ok (and obviously easier) to use the language data in wikidata?
Thanks!
On Thu, Dec 18, 2014 at 8:19 PM, Federico Leva (Nemo) nemowiki@gmail.com wrote:
Neta Livneh, 18/12/2014 19:04:
I think this tool actually does the opposite as what I need as I want to get all the pages that are both in HE and in EN. Obviously, I can take all the articles in Hebrew and subtract the tool's results, but I wonder if that's the best way to go.
Ok. You can still reuse its code/approach.
Nemo
Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
Neta Livneh, 18/12/2014 19:31:
Does it mean that there are not a lot of inline interlanguage links (that is, information that is not in wikidata, or contradicts wikidata) so it is ok (and obviously easier) to use the language data in wikidata?
The amount of "local" interwikis is negligible, perhaps a couple millions out of 100-200 millions they used to be. https://stats.wikimedia.org/EN/TablesDatabaseWikiLinks.htm
Nemo