Hello.
I am doing some converts to aarddict https://aarddict.org/ offline wikipedia and wiktionary app. I use mw2slob and the N0 files found on https://dumps.wikimedia.org/other/enterprise_html/runs/ for this conversions.
But in the spanish Wikipedia for example the article https://es.wikipedia.org/wiki/Anexo:Aves_de_Canarias seems not to be part of the tar.gz file.
And in the french Wiktionary the article https://fr.wiktionary.org/wiki/Conjugaison:espagnol/aumentar also is missing in the respective tar.gz file.
Can they be found somewhere else? In N6 or N14? For me it seems that articles/pages that have a colon like Anexo: or Conjugaison: are not part. But why? And where could I find them? Or are they to big or what is the idea of not including them?
Regards, Erik
That eswiki page is in namespace 'wgNamespaceNumber":104 the FR page is "wgNamespaceNumber":116
On Sun, Feb 13, 2022 at 2:17 PM Erik del Toro erik@deltoro.at wrote:
Hello.
I am doing some converts to aarddict https://aarddict.org/ offline wikipedia and wiktionary app. I use mw2slob and the N0 files found on https://dumps.wikimedia.org/other/enterprise_html/runs/ for this conversions.
But in the spanish Wikipedia for example the article https://es.wikipedia.org/wiki/Anexo:Aves_de_Canarias seems not to be part of the tar.gz file.
And in the french Wiktionary the article https://fr.wiktionary.org/wiki/Conjugaison:espagnol/aumentar also is missing in the respective tar.gz file.
Can they be found somewhere else? In N6 or N14? For me it seems that articles/pages that have a colon like Anexo: or Conjugaison: are not part. But why? And where could I find them? Or are they to big or what is the idea of not including them?
Regards, Erik
Xmldatadumps-l mailing list -- xmldatadumps-l@lists.wikimedia.org To unsubscribe send an email to xmldatadumps-l-leave@lists.wikimedia.org
Il 13/02/22 21:16, Erik del Toro ha scritto:
Can they be found somewhere else? In N6 or N14? For me it seems that articles/pages that have a colon like Anexo: or Conjugaison: are not part.
These are not namespace 0. Perhaps the export process forgot to respect $wgContentNamespaces?
Federico
Can they be found somewhere else? In N6 or N14? For me it seems that articles/pages that have a colon like Anexo: or Conjugaison: are not part.
These are not namespace 0. Perhaps the export process forgot to respect $wgContentNamespaces?
I don't think this these namespaces are included in $wgContentNamespaces on the Wiktionaries.
I've created a phabricator ticket to request more namespaces to be included in the dump, not sure if this is the correct process/project tag:
https://phabricator.wikimedia.org/T303652
–Jan
Just wanted to tell you, that http://aarddict.org users and dictionary creators also stumbled about these missing namespaces and are now suggesting to continue scraping these. So is scraping the expected approach? See here: https://groups.google.com/g/aarddict/c/WssxfWQYsto
Regards, Erik
Am 17.03.22 um 21:39 schrieb Jan Berkel:
Can they be found somewhere else? In N6 or N14? For me it seems that articles/pages that have a colon like Anexo: or Conjugaison: are not part.
These are not namespace 0. Perhaps the export process forgot to respect $wgContentNamespaces?
I don't think this these namespaces are included in $wgContentNamespaces on the Wiktionaries.
I've created a phabricator ticket to request more namespaces to be included in the dump, not sure if this is the correct process/project tag:
https://phabricator.wikimedia.org/T303652
–Jan _______________________________________________ Xmldatadumps-l mailing list -- xmldatadumps-l@lists.wikimedia.org To unsubscribe send an email to xmldatadumps-l-leave@lists.wikimedia.org
Il 18/03/22 14:04, Erik del Toro ha scritto:
Just wanted to tell you, thathttp://aarddict.org users and dictionary creators also stumbled about these missing namespaces and are now suggesting to continue scraping these. So is scraping the expected approach?
Thanks for mentioning this. Not sure what you mean by scraping here exactly: if you mean parsing the wikitext, definitely not; if you mean getting the already-parsed HTML from the REST API, it's acceptable. https://www.mediawiki.org/wiki/API:REST_API/Reference#Get_HTML
As for HTML dumps, the ZIM files by Kiwix for the French Wiktionary include pages like "Conjugaison:espagnol/aumentar", so that's another possible avenue for bulk imports. I've checked the latest version: https://download.kiwix.org/zim/wiktionary/wiktionary_fr_all_nopic_2022-01.zi...
Federico
xmldatadumps-l@lists.wikimedia.org