[Foundation-l] Tragical dynamics: that run for the number of articles
Tomasz Ganicz
polimerek at gmail.com
Sat Jun 28 13:27:46 UTC 2008
2008/6/28 Ziko van Dijk <zvandijk at googlemail.com>:
> There is Google Translater, and the Interwikis help as well. That
> article of he.WP about Lodz I would count as a real article, because
> there is information more than in a data base (links to Holocaust
> related articles, something about 19th century, economy (textile)).
> Indeed, I would like to make a more scientific scheme and apply it to
> a larger sample, maybe there will establish a research group about. I
> believe that my method does give a reasonable picture; of course,
> whether my results say "50.000" real articles or "52.000" is not
> really a measurable difference.
Sorry about it, but it only shows that your results are not reliable,
because it is based on your feelings and poor quality machine
translations which could change in unpredictable way your feelings. I
might be affraid that the results shown in your table is just a
reflection of:
a) the quality of machine translation performed by google - it is
better for latin and germanic based languages (English, French,
Italian, German, Dutch etc.) and much worse for slavic, arabic and
East Asian languages.
b)your own subconcious attitude toward various nations and Wikipedias
- even if you are trying to evaluate them all fair
Google translate produces sometimes really funny results when
translating from Polish to English. For example:
"Przyszłość partii przyszłością narodu" (Future of the party is the
future of the nation) is translated to:
"The future of the future of the nation lot" :-)
Or:
"Byłbym spał, gdybym mógł." ( I would sleep, if only I could)
is translated to:
"I would be he lay, if I only could."
http://en.wikipedia.org/wiki/Machine_translation_software_usability#Trustworthiness_and_Security
http://www.nist.gov/speech/tests/mt/2006/doc/mt06eval_official_results.html
I think that a method to distunguish between "real" and "unreal"
articles should be based on analysis of the history of article and
formal "hard" criteria.
For example one can make a criteria that if there are at least 4
sentences writen by a human it is "real article".
--
Tomek "Polimerek" Ganicz
http://pl.wikimedia.org/wiki/User:Polimerek
http://www.ganicz.pl/poli/
http://www.ptchem.lodz.pl/en/TomaszGanicz.html
More information about the foundation-l
mailing list