On 01/09/2012 04:49 PM, Sébastien Druon wrote:
What is the best/easiest way to get a parsed version (including template resolution) of all entries of a wiktionary (separate html files for each entry for example).
I think that varies with what you are trying to accomplish. I found it very useful in many situations to use this Perl script, http://meta.wikimedia.org/wiki/User:LA2/Extraktor
For example, it can easily be modified (if you are a Perl programmer) to extract a list of all Russian language entries from the Russian Wiktionary, which was your previous question.
The script loops over the lines of the (well formed, nicely indented) XML dump, accumulates lines belonging to one wiki page, and then runs a set of conditions (the "if" statement) for each page. You can pipe the decompressed dump through such a script, and you never have to store the decompressed data. That way, it's very time and space efficient.
A good Python programmer can probably do the same in half the amount of code.