On 01/09/2012 04:49 PM, Sébastien Druon wrote:
What is the best/easiest way to get a parsed version
(including
template resolution) of all entries of a wiktionary (separate html
files for each entry for example).
I think that varies with what you are trying to accomplish. I found
it very useful in many situations to use this Perl script,
http://meta.wikimedia.org/wiki/User:LA2/Extraktor
For example, it can easily be modified (if you are a Perl programmer)
to extract a list of all Russian language entries from the Russian
Wiktionary, which was your previous question.
The script loops over the lines of the (well formed, nicely indented)
XML dump, accumulates lines belonging to one wiki page, and then
runs a set of conditions (the "if" statement) for each page.
You can pipe the decompressed dump through such a script, and you
never have to store the decompressed data. That way, it's very
time and space efficient.
A good Python programmer can probably do the same in half the amount
of code.
--
Lars Aronsson (lars(a)aronsson.se)
Aronsson Datateknik -
http://aronsson.se