On 9 January 2012 10:51, Jérémie Roquet <arkanosis@gmail.com> wrote:

Hi Sebastien,

2012/1/9 Sébastien Druon <druon.sebastien@gmail.com>:

> How is it possible to get the list of all the entries (words) of a
> wiktionary?
> For example, for the russian wiktionary, I want to get the list of all the
> russian entries (no other languages)

You can download the last dumpš and pass it through a simple awk
script that prints the titles of the pages that contain the {{-ru-}}
template, ie. something like:

----8<----

BEGIN {
ru = 0
}

END {
if (ru) {
print title
}
}

/<title>.*?</ {
if (ru) {
print title
}

title = substr($0, 12, length($0) - 19)
ru = 0
}

tolower($0) ~ /{{-ru-}}/ {
ru = 1
}

----8<----

You'd still have to filter the output to only keep titles in the main namespace.

It should be possible using categories too, but this wouldn't be any
easier nor more reliable and would be much slower.

Best regards,

šhttp://dumps.wikimedia.org/ruwiktionary/20120107/ruwiktionary-20120107-pages-articles.xml.bz2

--
Jérémie