List of all words of a wiktionary - Xmldatadumps-l - lists.wikimedia.org

List overview All Threads
Download

List of all words of a wiktionary

Mediawiki: Issue with...

Best way to get html/parsed...

Sébastien Druon

9 Jan 2012 9 Jan '12

1:17 a.m.

Hello, How is it possible to get the list of all the entries (words) of a wiktionary? For example, for the russian wiktionary, I want to get the list of all the russian entries (no other languages) Thanks a lot in advance, Sebastien

Attachments:

attachment.htm (text/html — 334 bytes)

Reply

Show replies by date

Jérémie Roquet

9 Jan 9 Jan

11:51 a.m.

Hi Sebastien, 2012/1/9 Sébastien Druon <druon.sebastien(a)gmail.com>om>:

How is it possible to get the list of all the entries (words) of a wiktionary? For example, for the russian wiktionary, I want to get the list of all the russian entries (no other languages)

You can download the last dump¹ and pass it through a simple awk script that prints the titles of the pages that contain the {{-ru-}} template, ie. something like: ----8<---- BEGIN { ru = 0 } END { if (ru) { print title } } /<title>.*?</ { if (ru) { print title } title = substr($0, 12, length($0) - 19) ru = 0 } tolower($0) ~ /{{-ru-}}/ { ru = 1 } ----8<---- You'd still have to filter the output to only keep titles in the main namespace. It should be possible using categories too, but this wouldn't be any easier nor more reliable and would be much slower. Best regards, ¹http://dumps.wikimedia.org/ruwiktionary/20120107/ruwiktionary-20120107-pag… -- Jérémie

Reply

Sébastien Druon

12:21 p.m.

Thanks you all for the tips. Maybe you have a good tip to get the parsed/html version of these entries in an easy way? Thanks again Sebastien On 9 January 2012 10:51, Jérémie Roquet <arkanosis(a)gmail.com> wrote:

Hi Sebastien, 2012/1/9 Sébastien Druon <druon.sebastien(a)gmail.com>om>:

How is it possible to get the list of all the entries (words) of a wiktionary? For example, for the russian wiktionary, I want to get the list of all

the

russian entries (no other languages)

You can download the last dump¹ and pass it through a simple awk script that prints the titles of the pages that contain the {{-ru-}} template, ie. something like: ----8<---- BEGIN { ru = 0 } END { if (ru) { print title } } /<title>.*?</ { if (ru) { print title } title = substr($0, 12, length($0) - 19) ru = 0 } tolower($0) ~ /{{-ru-}}/ { ru = 1 } ----8<---- You'd still have to filter the output to only keep titles in the main namespace. It should be possible using categories too, but this wouldn't be any easier nor more reliable and would be much slower. Best regards, ¹ http://dumps.wikimedia.org/ruwiktionary/20120107/ruwiktionary-20120107-page… -- Jérémie

Reply

Ariel T. Glenn

1:17 p.m.

Among the files produced for each dump of each project is a list of all titles in namespace 0. This would include all lemmas for a wiktionary. See, for example, http://dumps.wikimedia.org/ruwiktionary/20120107/ruwiktionary-20120107-all-… Ariel Στις 09-01-2012, ημέρα Δευ, και ώρα 11:21 +0100, ο/η Sébastien Druon έγραψε:

Thanks you all for the tips. Maybe you have a good tip to get the parsed/html version of these entries in an easy way? Thanks again Sebastien On 9 January 2012 10:51, Jérémie Roquet <arkanosis(a)gmail.com> wrote: Hi Sebastien, 2012/1/9 Sébastien Druon <druon.sebastien(a)gmail.com>om>:

How is it possible to get the list of all the entries

(words) of a

wiktionary? For example, for the russian wiktionary, I want to get the

list of all the

russian entries (no other languages)

You can download the last dump¹ and pass it through a simple awk script that prints the titles of the pages that contain the {{-ru-}} template, ie. something like: ----8<---- BEGIN { ru = 0 } END { if (ru) { print title } } /<title>.*?</ { if (ru) { print title } title = substr($0, 12, length($0) - 19) ru = 0 } tolower($0) ~ /{{-ru-}}/ { ru = 1 } ----8<---- You'd still have to filter the output to only keep titles in the main namespace. It should be possible using categories too, but this wouldn't be any easier nor more reliable and would be much slower. Best regards, ¹http://dumps.wikimedia.org/ruwiktionary/20120107/ruwiktionary-20120107-pag… -- Jérémie _______________________________________________ Xmldatadumps-l mailing list Xmldatadumps-l(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l

Reply

Platonides

11:08 p.m.

On 09/01/12 10:51, Jérémie Roquet wrote:

You'd still have to filter the output to only keep titles in the main namespace. It should be possible using categories too, but this wouldn't be any easier nor more reliable and would be much slower. Best regards,

I disagree. You would only need the much smaller ruwiktionary-20120107-category.sql.gz + ruwiktionary-20120107-page.sql.gz

Reply

Jérémie Roquet

10 Jan 10 Jan

11:26 a.m.

Hi Platonides, 2012/1/9 Platonides <platonides(a)gmail.com>om>:

On 09/01/12 10:51, Jérémie Roquet wrote:

It should be possible using categories too, but this wouldn't be any easier nor more reliable and would be much slower.

I disagree. You would only need the much smaller ruwiktionary-20120107-category.sql.gz + ruwiktionary-20120107-page.sql.gz

Good point. Still, you'd have to understand the categories' hierarchy :-) Best regards, -- Jérémie

Reply

4488

days inactive

4490

days old

xmldatadumps-l@lists.wikimedia.org

Manage subscription

5 comments

4 participants

tags (0)

participants (4)

Ariel T. Glenn
Jérémie Roquet
Platonides
Sébastien Druon