Hi Amir,
You might want to drop Japanese at least for now, because it's complicated. Between any language and Japanese, there is no single standard to follow as far as I know. In fact, choosing the best way to write the name (and thus the page name) is one of the most frequent types of disputes about foreign BLPs on Japanese Wikipedia. There are always multiple possible transliterations for each syllable (it's usually phonetic, not letter-to-letter, transliteration), and a random choice will probably not the best one. For instance, "Amir" can be "アミール", "アーミール" or "アーミル".
That said, one possible approach might be to generate, for each name, all possible transliterations using a set of rules, and google them to see which is the most common one (and has at least some uses on the web). It would probably be safer to add the name in the source language into the query, and reject any candidate that gets less than, say, 100 hits, because some small number of hits might be just noise.
Best, Yusuke
On Wed, Apr 22, 2015 at 9:48 PM, Amir Ladsgroup ladsgroup@gmail.com wrote:
Hello, I started bot of auto-transliterating names of humans, initially with Persian and English (as a pair) since I know both and I can debug. After some modifications, In the last check, In more than several hundreds of edits I checked, I couldn't find any errors, I want to expand this bot for other languages but before, I need opinions of people who know rules of transliterating names in these languages, I tried to realize rules and this is my result but I need someone familiar to confirm *Chinese: Instead of space it uses "·" character (it's not dot) but order is the same. e.g Alan Turing is: "艾伦·图灵" which "艾伦" means Alan and "图灵" means Turing *Japanese: it's the same but different separator: "・", e.g. "アラン・チューリング" *Russian: The separator is space character but order is like "FamilyName, GivenName" e.g. "Тьюринг, Алан" is "Turing, Alan". Handling names with more than two words would be pretty complicated (I skip them) *I checked Hebrew and Greek and both are simple languages like Persian, same order, space as separator.
If you can help me, it would make a great difference in number of labels in your language. Things you can help are: 1- Confirm or correct rules of these languages and add other rules if needed. 2- Suggest more languages. I thought about Sanskrit, Hindi, and Telugu but I don't know anyone who can check the rules, if you do, please help me. 3- For any language I will do an initial run just to test, if you can check edits of the bot (which is pretty easy, e.g. see this) it would be awesome.
Thanks, Best
Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l