Re: [Wikitech-l] Mapping Hiragana and Katakana

21 Sep 2017


      ...
Well, I would expect "phonetic:" would bind with something like IPA, but
the concept of keyword is interesting.
Finding good names for keywords is also an art. "phonetic:" came to mind
because the algorithms used to index words by pronunciation are
collectively called phonetic algorithms[1]. You could conceivably also map
to IPA, but in general the algorithms are much less detailed than IPA,
because they are trying to find a balance between inclusivity and
exclusivity in grouping similar words (a lot drop non-initial vowels, for
example), while IPA is usually much more specific.
Mapping IPA into such a system would be interesting. Say I heard someone
talk about someone named /ɡədɑfi/—hopefully that would allow me to find
Gaddafi (with his famously a hard-to-spell name). Amusingly, the character
folding on English Wikipedia maps ɡədɑfi to gadafi, which is a redirect to
Gaddafi—so IPA sometimes works now! But it wouldn't work for George /kluni/.
We have gone far afield now, but there are Phabricator tickets for advanced
search in general[2] and phonetic search specifically[3] if anyone wants to
follow up there.
[1] https://en.wikipedia.org/wiki/Phonetic_algorithm
[2] https://phabricator.wikimedia.org/T174064
[3] https://phabricator.wikimedia.org/T174705
On Thu, Sep 21, 2017 at 6:50 AM, mathieu stumpf guntz <
psychoslave@culture-libre.org> wrote:
...
Le 20/09/2017 à 03:40, Trey Jones a écrit :
Anyway, would it be a big deal to show the transliterated results with
...
less weight in ranking?
Doing any special weighting would be more difficult, but they would
already be naturally ranked lower for not being exact matches. (You can see
this at work if you compare the results for *resume, resumé,* and *résumé*
on English Wikipedia, for example.)
Interesting to know. Thank you.
Actually, add an option button in advanced search in any case, and just
...
limit discussion about should it be opt-in or opt-out.
There are longer term plans for revamping advanced search capabilities, so
if we want to go that route, it's doable, but it would definitely be on
hold for a while. Options that have been mentioned include a special case
keyword like "kana:オオカミ", or a more generic keyword like "phonetic:オオカミ"
that was smart enough to know what to do with kana, but might do something
different with other characters... but that's all at the vague ideation
stage right now.
Well, I would expect "phonetic:" would bind with something like IPA, but
the concept of keyword is interesting.
Thanks!
Trey Jones
Sr. Software Engineer, Search Platform
Wikimedia Foundation
On Tue, Sep 19, 2017 at 8:29 PM, mathieu stumpf guntz <
psychoslave@culture-libre.org> wrote:
...
Le 19/09/2017 à 23:47, Trey Jones a écrit :
We recently got a suggestion via Phabricator[1] to automatically map
between hiragana and katakana when searching on English Wikipedia and other
wiki projects. As an always-on feature, this isn't difficult to implement,
but major commercial search engines (Google.jp, Bing, Yahoo Japan,
DuckDuckGo, Goo) don't do that. They give different results when searching
for hiragana/katakana forms (for example, オオカミ/おおかみ "wolf"). They also give
different *numbers* of results, seeming to indicate that it's not just
re-ordering the same results (say, so that results in the same script are
ranked higher).[2] I want to know what they know that I don't!
Does anyone have any thoughts on whether this would be useful (seems that
it would) and whether it would cause any problems (it must, or otherwise
all the other search engines would do it, right?).
Well, maybe. Or not. Look how Duckduckgo continue to only give a
"country" option to filter *languages*. Now both might be complementary,
but personally I'm generally more interested with the later. All the more
when
I'm using a language which have no country using it as official language.
:)
Anyway, would it be a big deal to show the transliterated results with
less
weight in ranking? Actually, add an option button in advanced search in
any
case, and just limit discussion about should it be opt-in or opt-out.
Any idea why it might be different between a Japanese-language wiki and a
non-Japanese-language wiki? We often are more aggressive in matching
between characters that are not native to a given language--for example,
accents on Latin characters are generally ignored on English-language
wikis. So it might make sense to merge hiragana and katakana on
English-language wikis but not Japanese-language wikis.
Thanks very much for any suggestions or information!
—Trey
どういたしました。
[1] https://phabricator.wikimedia.org/T176197
[2] Details of my tests at https://phabricator.wikimedia.org/T173650#3580309
Trey Jones
Sr. Software Engineer, Search Platform
Wikimedia Foundation
_______________________________________________
Wikitech-l mailing listWikitech-l@lists.wikimedia.orghttps://lists.wikimedia.org/mailman/listinfo/wikitech-l

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] Mapping Hiragana and Katakana