Hi Amir,
You might want to drop Japanese at least for now, because it's
complicated. Between any language and Japanese, there is no single
standard to follow as far as I know. In fact, choosing the best way to
write the name (and thus the page name) is one of the most frequent
types of disputes about foreign BLPs on Japanese Wikipedia. There are
always multiple possible transliterations for each syllable (it's
usually phonetic, not letter-to-letter, transliteration), and a random
choice will probably not the best one. For instance, "Amir" can be
"アミール", "アーミール" or "アーミル".
That said, one possible approach might be to generate, for each name,
all possible transliterations using a set of rules, and google them to
see which is the most common one (and has at least some uses on the
web). It would probably be safer to add the name in the source
language into the query, and reject any candidate that gets less than,
say, 100 hits, because some small number of hits might be just noise.
Best,
Yusuke
On Wed, Apr 22, 2015 at 9:48 PM, Amir Ladsgroup <ladsgroup(a)gmail.com> wrote:
Hello,
I started bot of auto-transliterating names of humans, initially with
Persian and English (as a pair) since I know both and I can debug. After
some modifications, In the last check, In more than several hundreds of
edits I checked, I couldn't find any errors, I want to expand this bot for
other languages but before, I need opinions of people who know rules of
transliterating names in these languages, I tried to realize rules and this
is my result but I need someone familiar to confirm
*Chinese: Instead of space it uses "·" character (it's not dot) but order
is
the same. e.g Alan Turing is: "艾伦·图灵" which "艾伦" means Alan and
"图灵" means
Turing
*Japanese: it's the same but different separator: "・", e.g.
"アラン・チューリング"
*Russian: The separator is space character but order is like "FamilyName,
GivenName" e.g. "Тьюринг, Алан" is "Turing, Alan". Handling
names with more
than two words would be pretty complicated (I skip them)
*I checked Hebrew and Greek and both are simple languages like Persian, same
order, space as separator.
If you can help me, it would make a great difference in number of labels in
your language. Things you can help are:
1- Confirm or correct rules of these languages and add other rules if
needed.
2- Suggest more languages. I thought about Sanskrit, Hindi, and Telugu but I
don't know anyone who can check the rules, if you do, please help me.
3- For any language I will do an initial run just to test, if you can check
edits of the bot (which is pretty easy, e.g. see this) it would be awesome.
Thanks,
Best
_______________________________________________
Wikidata-l mailing list
Wikidata-l(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l