On Thu, Apr 23, 2015 at 1:41 PM, Yusuke Matsubara whym@whym.org wrote:
Hi Amir,
You might want to drop Japanese at least for now, because it's complicated. Between any language and Japanese, there is no single standard to follow as far as I know. In fact, choosing the best way to write the name (and thus the page name) is one of the most frequent types of disputes about foreign BLPs on Japanese Wikipedia. There are always multiple possible transliterations for each syllable (it's usually phonetic, not letter-to-letter, transliteration), and a random choice will probably not the best one. For instance, "Amir" can be "アミール", "アーミール" or "アーミル".
That said, one possible approach might be to generate, for each name, all possible transliterations using a set of rules, and google them to see which is the most common one (and has at least some uses on the web). It would probably be safer to add the name in the source language into the query, and reject any candidate that gets less than, say, 100 hits, because some small number of hits might be just noise.
This is not how the bot words, It uses an algorithm to cluster words not
letters since it's useless (lots of languages like Persian has the same situation as Japanese) so we don't need to deal with complicated rules of transliteration directly, the bot automatically follows them and chooses the most frequent result.
I will make 100 edits and you check and tell if it's wrong. I started the bot to analyse.
Best
Best, Yusuke
On Wed, Apr 22, 2015 at 9:48 PM, Amir Ladsgroup ladsgroup@gmail.com wrote:
Hello, I started bot of auto-transliterating names of humans, initially with Persian and English (as a pair) since I know both and I can debug. After some modifications, In the last check, In more than several hundreds of edits I checked, I couldn't find any errors, I want to expand this bot
for
other languages but before, I need opinions of people who know rules of transliterating names in these languages, I tried to realize rules and
this
is my result but I need someone familiar to confirm *Chinese: Instead of space it uses "·" character (it's not dot) but
order is
the same. e.g Alan Turing is: "艾伦·图灵" which "艾伦" means Alan and "图灵"
means
Turing *Japanese: it's the same but different separator: "・", e.g. "アラン・チューリング" *Russian: The separator is space character but order is like "FamilyName, GivenName" e.g. "Тьюринг, Алан" is "Turing, Alan". Handling names with
more
than two words would be pretty complicated (I skip them) *I checked Hebrew and Greek and both are simple languages like Persian,
same
order, space as separator.
If you can help me, it would make a great difference in number of labels
in
your language. Things you can help are: 1- Confirm or correct rules of these languages and add other rules if needed. 2- Suggest more languages. I thought about Sanskrit, Hindi, and Telugu
but I
don't know anyone who can check the rules, if you do, please help me. 3- For any language I will do an initial run just to test, if you can
check
edits of the bot (which is pretty easy, e.g. see this) it would be
awesome.
Thanks, Best
Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l