Re: [Xmldatadumps-l] List of all words of a wiktionary

9 Jan 2012


      Among the files produced for each dump of each project is a list of all
titles in namespace 0.  This would include all lemmas for a wiktionary.
See, for example,
http://dumps.wikimedia.org/ruwiktionary/20120107/ruwiktionary-20120107-all-t...
Ariel
Στις 09-01-2012, ημέρα Δευ, και ώρα 11:21 +0100, ο/η Sébastien Druon
έγραψε:
...
Thanks you all for the tips.
Maybe you have a good tip to get the parsed/html version of these
entries in an easy way?
Thanks again
Sebastien
On 9 January 2012 10:51, Jérémie Roquet arkanosis@gmail.com wrote:
        Hi Sebastien,
    2012/1/9 Sébastien Druon <druon.sebastien@gmail.com>:
    > How is it possible to get the list of all the entries
    (words) of a
    > wiktionary?
    > For example, for the russian wiktionary, I want to get the
    list of all the
    > russian entries (no other languages)


    You can download the last dump¹ and pass it through a simple
    awk
    script that prints the titles of the pages that contain the
    {{-ru-}}
    template, ie. something like:

    ----8<----

    BEGIN {
     ru = 0
    }

    END {
     if (ru) {
       print title
     }
    }

    /<title>.*?</ {
     if (ru) {
       print title
     }

     title = substr($0, 12, length($0) - 19)
     ru = 0
    }

    tolower($0) ~ /{{-ru-}}/ {
     ru = 1
    }

    ----8<----

    You'd still have to filter the output to only keep titles in
    the main namespace.

    It should be possible using categories too, but this wouldn't
    be any
    easier nor more reliable and would be much slower.

    Best regards,

    ¹http://dumps.wikimedia.org/ruwiktionary/20120107/ruwiktionary-20120107-pages-articles.xml.bz2

    --
    Jérémie


Xmldatadumps-l mailing list
Xmldatadumps-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

Re: [Xmldatadumps-l] List of all words of a wiktionary