[Xmldatadumps-l] List of all words of a wiktionary

Sébastien Druon druon.sebastien at gmail.com
Mon Jan 9 10:21:01 UTC 2012


Thanks you all for the tips.

Maybe you have a good tip to get the parsed/html version of these entries
in an easy way?

Thanks again

Sebastien

On 9 January 2012 10:51, Jérémie Roquet <arkanosis at gmail.com> wrote:

> Hi Sebastien,
>
> 2012/1/9 Sébastien Druon <druon.sebastien at gmail.com>:
> > How is it possible to get the list of all the entries (words) of a
> > wiktionary?
> > For example, for the russian wiktionary, I want to get the list of all
> the
> > russian entries (no other languages)
>
> You can download the last dump¹ and pass it through a simple awk
> script that prints the titles of the pages that contain the {{-ru-}}
> template, ie. something like:
>
> ----8<----
>
> BEGIN {
>  ru = 0
> }
>
> END {
>  if (ru) {
>    print title
>  }
> }
>
> /<title>.*?</ {
>  if (ru) {
>    print title
>  }
>
>  title = substr($0, 12, length($0) - 19)
>  ru = 0
> }
>
> tolower($0) ~ /{{-ru-}}/ {
>  ru = 1
> }
>
> ----8<----
>
> You'd still have to filter the output to only keep titles in the main
> namespace.
>
> It should be possible using categories too, but this wouldn't be any
> easier nor more reliable and would be much slower.
>
> Best regards,
>
> ¹
> http://dumps.wikimedia.org/ruwiktionary/20120107/ruwiktionary-20120107-pages-articles.xml.bz2
>
> --
> Jérémie
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.wikimedia.org/pipermail/xmldatadumps-l/attachments/20120109/0c138f1e/attachment.htm 


More information about the Xmldatadumps-l mailing list