On Thu, Apr 23, 2015 at 1:41 PM, Yusuke Matsubara <whym@whym.org> wrote:
Hi Amir,

You might want to drop Japanese at least for now, because it's
complicated. Between any language and Japanese, there is no single
standard to follow as far as I know. In fact, choosing the best way to
write the name (and thus the page name) is one of the most frequent
types of disputes about foreign BLPs on Japanese Wikipedia. There are
always multiple possible transliterations for each syllable (it's
usually phonetic, not letter-to-letter, transliteration), and a random
choice will probably not the best one. For instance, "Amir" can be
"アミール", "アーミール" or "アーミル".

That said, one possible approach might be to generate, for each name,
all possible transliterations using a set of rules, and google them to
see which is the most common one (and has at least some uses on the
web). It would probably be safer to add the name in the source
language into the query, and reject any candidate that gets less than,
say, 100 hits, because some small number of hits might be just noise.

This is not how the bot words, It uses an algorithm to cluster words not letters since it's useless (lots of languages like Persian has the same situation as Japanese) so we don't need to deal with complicated rules of transliteration directly, the bot automatically follows them and chooses the most frequent result.

I will make 100 edits and you check and tell if it's wrong. I started the bot to analyse.

Best

 
Best,
Yusuke

On Wed, Apr 22, 2015 at 9:48 PM, Amir Ladsgroup <ladsgroup@gmail.com> wrote:
> Hello,
> I started bot of auto-transliterating names of humans, initially with
> Persian and English (as a pair) since I know both and I can debug. After
> some modifications, In the last check, In more than several hundreds of
> edits I checked, I couldn't find any errors, I want to expand this bot for
> other languages but before, I need opinions of people who know rules of
> transliterating names in these languages, I tried to realize rules and this
> is my result but I need someone familiar to confirm
> *Chinese: Instead of space it uses "·" character (it's not dot) but order is
> the same. e.g Alan Turing is: "艾伦·图灵" which "艾伦" means Alan and "图灵" means
> Turing
> *Japanese: it's the same but different separator: "・", e.g. "アラン・チューリング"
> *Russian: The separator is space character but order is like "FamilyName,
> GivenName" e.g. "Тьюринг, Алан" is "Turing, Alan". Handling names with more
> than two words would be pretty complicated (I skip them)
> *I checked Hebrew and Greek and both are simple languages like Persian, same
> order, space as separator.
>
> If you can help me, it would make a great difference in number of labels in
> your language. Things you can help are:
> 1- Confirm or correct rules of these languages and add other rules if
> needed.
> 2- Suggest more languages. I thought about Sanskrit, Hindi, and Telugu but I
> don't know anyone who can check the rules, if you do, please help me.
> 3- For any language I will do an initial run just to test, if you can check
> edits of the bot (which is pretty easy, e.g. see this) it would be awesome.
>
> Thanks,
> Best
>
> _______________________________________________
> Wikidata-l mailing list
> Wikidata-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikidata-l
>

_______________________________________________
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l



--
Amir