Best way to get html/parsed version of a wiktionary - Xmldatadumps-l - lists.wikimedia.org

List overview All Threads
Download

Best way to get html/parsed version of a wiktionary

List of all words of a wiktionary

Import of an XML dump of the RU...

Sébastien Druon

9 Jan 2012 9 Jan '12

4:49 p.m.

Hi! What is the best/easiest way to get a parsed version (including template resolution) of all entries of a wiktionary (separate html files for each entry for example). Thanks in advance Sebastien

Attachments:

attachment.htm (text/html — 494 bytes)

Reply

Show replies by date

Lars Aronsson

9 Jan 9 Jan

5:06 p.m.

New subject: Best way to get html/parsed version of a wiktionary

On 01/09/2012 04:49 PM, Sébastien Druon wrote:

What is the best/easiest way to get a parsed version (including template resolution) of all entries of a wiktionary (separate html files for each entry for example).

I think that varies with what you are trying to accomplish. I found it very useful in many situations to use this Perl script, http://meta.wikimedia.org/wiki/User:LA2/Extraktor For example, it can easily be modified (if you are a Perl programmer) to extract a list of all Russian language entries from the Russian Wiktionary, which was your previous question. The script loops over the lines of the (well formed, nicely indented) XML dump, accumulates lines belonging to one wiki page, and then runs a set of conditions (the "if" statement) for each page. You can pipe the decompressed dump through such a script, and you never have to store the decompressed data. That way, it's very time and space efficient. A good Python programmer can probably do the same in half the amount of code. -- Lars Aronsson (lars(a)aronsson.se) Aronsson Datateknik - http://aronsson.se

Reply

4490

days inactive

4490

days old

xmldatadumps-l@lists.wikimedia.org

Manage subscription

1 comments

2 participants

tags (0)

participants (2)

Lars Aronsson
Sébastien Druon