Gerard Meijssen wrote:
The HTML can be read as well as the codes are changed to be in the pre-UTF format (eg ыш etc). It can therefore be parsed, eventually I could upload it to wiktionary. The question is how do I convert it to UTF-8??
You can use the software i wrote to convert some wikis to utf-8 : http://mboquien.free.fr/wikiconvert-20040902.tar.gz Don't read the "README" file as it is completely outdated, i've to rewrite it. The usage is simple : ./wikiconv -i input_file -o output_file -e encoding_of_the_input_file. It needs Qt and it works fine under linux. Not tried on other OS.
A question about the UTF-8 conversion, is it possible to have a bot convert the non UTF-8 stuff to UTF-8 on en:wiktionary ??
It is also possible with the program but it would put the en wiktionary read only during a few minutes, shaihulud is used to this kind of operations. I use it quite often on fr:, but manually to convert some pages that contain many utf-8 entities.
Med