Re: [Xmldatadumps-l] Best way to get html/parsed version of a wiktionary

9 Jan 2012


      On 01/09/2012 04:49 PM, Sébastien Druon wrote:
...
What is the best/easiest way to get a parsed version (including 
template resolution) of all entries of a wiktionary (separate html 
files for each entry for example).
I think that varies with what you are trying to accomplish. I found
it very useful in many situations to use this Perl script,
http://meta.wikimedia.org/wiki/User:LA2/Extraktor
For example, it can easily be modified (if you are a Perl programmer)
to extract a list of all Russian language entries from the Russian
Wiktionary, which was your previous question.
The script loops over the lines of the (well formed, nicely indented)
XML dump, accumulates lines belonging to one wiki page, and then
runs a set of conditions (the "if" statement) for each page.
You can pipe the decompressed dump through such a script, and you
never have to store the decompressed data. That way, it's very
time and space efficient.
A good Python programmer can probably do the same in half the amount
of code.
-- 
   Lars Aronsson (lars@aronsson.se)
   Aronsson Datateknik - http://aronsson.se

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

Re: [Xmldatadumps-l] Best way to get html/parsed version of a wiktionary