Hoi, After analysing how to parse the text version of a GEMET list, I decided to also have a look at the html code. The reason was that the Russian, Bulgarian, Greek characters became unreadable. The HTML can be read as well as the codes are changed to be in the pre-UTF format (eg ыш etc). It can therefore be parsed, eventually I could upload it to wiktionary. The question is how do I convert it to UTF-8??
A question about the UTF-8 conversion, is it possible to have a bot convert the non UTF-8 stuff to UTF-8 on en:wiktionary ??
Thanks, GerardM
Gerard Meijssen wrote:
The HTML can be read as well as the codes are changed to be in the pre-UTF format (eg ыш etc). It can therefore be parsed, eventually I could upload it to wiktionary. The question is how do I convert it to UTF-8??
You can use the software i wrote to convert some wikis to utf-8 : http://mboquien.free.fr/wikiconvert-20040902.tar.gz Don't read the "README" file as it is completely outdated, i've to rewrite it. The usage is simple : ./wikiconv -i input_file -o output_file -e encoding_of_the_input_file. It needs Qt and it works fine under linux. Not tried on other OS.
A question about the UTF-8 conversion, is it possible to have a bot convert the non UTF-8 stuff to UTF-8 on en:wiktionary ??
It is also possible with the program but it would put the en wiktionary read only during a few minutes, shaihulud is used to this kind of operations. I use it quite often on fr:, but manually to convert some pages that contain many utf-8 entities.
Med
A question about the UTF-8 conversion, is it possible to have a bot convert the non UTF-8 stuff to UTF-8 on en:wiktionary ??
I belive there is a bot that does exactly that. You may want to contact Head at the german WP http://de.wikipedia.org/wiki/Benutzer_Diskussion:Head: he manages that bot (and he also wrote it, i belive).
Daniel
wikitech-l@lists.wikimedia.org