Feature request: Kana entities

List overview All Threads
Download

newer

older

Re: whitespace in search

PHP script internationalization...

Tomasz Wegrzanowski

7 Mar 2002 7 Mar '02

7:26 p.m.

This feature should be easy to do. Unfortunately my PHP knowledge is limited, so I think it will be better if I just ask for it instead of trying to do it myself :)

Using japanese characters in non-japanese wikipedias is currently hard. One have to write them as &#xHEXCODE; or &#DECIMALCODE;

I think that it would be much better if parser were able to parse fake kana (at least basic kana, full kanji would be much more work) &entities; and convert them to numeric codes.

So one can write &hiragana_wa; or &katakana_chi; This isn't likely to conflict with anything.

Kana Unicode table (in "English") is on http://pl.wikipedia.com/wiki.cgi?Kana

Entities that would be needed: * Full hiragana ぁ to ゔ * Full katakana ァ to ヺ * Prolongation mark ー

Proposed names: * &hiragana_x; &hiragana_smallx; * &katakana_x; &katakana_smallx; * &kana_long;

I also think that it might be good idea to extend it to other writing sytems in the future.

Is it possible ?

Show replies by date

Jimmy Wales

7 Mar 7 Mar

11:59 p.m.

This is a neat idea, but...

...

So one can write &hiragana_wa; or &katakana_chi; This isn't likely to conflict with anything.

Kana Unicode table (in "English") is on http://pl.wikipedia.com/wiki.cgi?Kana

In general, why would it ever be common that one would wish to put Kana into non-Japanese wikipedias?

At least at first thought (and I'm totally open to have my mind changed!) the only place in an English or Polish wikipedia where showing Kana would make sense would be an article _about_ Kana.

--Jimbo

Tomasz Wegrzanowski

8 Mar 8 Mar

6:07 a.m.

On Thu, Mar 07, 2002 at 03:59:34PM -0800, Jimmy Wales wrote:

...

This is a neat idea, but...

...
So one can write &hiragana_wa; or &katakana_chi; This isn't likely to conflict with anything.

Kana Unicode table (in "English") is on http://pl.wikipedia.com/wiki.cgi?Kana

In general, why would it ever be common that one would wish to put Kana into non-Japanese wikipedias?

At least at first thought (and I'm totally open to have my mind changed!) the only place in an English or Polish wikipedia where showing Kana would make sense would be an article _about_ Kana.

--Jimbo

Just see articles about anything Japanese on English Wikipedia. They contain Japanese names of everything.

Brion L. VIBBER

7:34 a.m.

On ĵaŭ, 2002-03-07 at 22:07, Tomasz Wegrzanowski wrote:

...

On Thu, Mar 07, 2002 at 03:59:34PM -0800, Jimmy Wales wrote:

...
This is a neat idea, but...

...
So one can write &hiragana_wa; or &katakana_chi; This isn't likely to conflict with anything.

Kana Unicode table (in "English") is on http://pl.wikipedia.com/wiki.cgi?Kana

In general, why would it ever be common that one would wish to put Kana into non-Japanese wikipedias?

At least at first thought (and I'm totally open to have my mind changed!) the only place in an English or Polish wikipedia where showing Kana would make sense would be an article _about_ Kana.

Just see articles about anything Japanese on English Wikipedia. They contain Japanese names of everything.

Sure, but more often kanji than kana, so special kana markup wouldn't be that big a win. See the thread "International Upgrades"; the vague plan is to standardise the internal character set and present the wikipedias in Unicode to capable browsers. (Please comment!)

As a result, we should be able to use the customary input methods or cut-n-paste to put any characters into any of the wikis, which is certainly a lot easier than looking up entities or running text through a UTF-8-to-entities convertor (which is what I currently do).

-- brion vibber (brion @ pobox.com)

Tomasz Wegrzanowski

12:07 p.m.

On Thu, Mar 07, 2002 at 11:34:53PM -0800, Brion L. VIBBER wrote:

...

On ??a??, 2002-03-07 at 22:07, Tomasz Wegrzanowski wrote:

...
Just see articles about anything Japanese on English Wikipedia. They contain Japanese names of everything.

Sure, but more often kanji than kana, so special kana markup wouldn't be that big a win. See the thread "International Upgrades"; the vague plan is to standardise the internal character set and present the wikipedias in Unicode to capable browsers. (Please comment!)

Uhm, right. But most non-japanese people don't know names of too many kanjis, so kanjis aren't that important. ;) On the other hand more people that it is usually though know kana, so it might be beneficial for them.

Hmmm. Now I think that some general method would be more useful: &katakana_a; &kanji_b; &hebrew_c; or &cyrilic_d;

I think that it won't need too many changes in parser. Perl code: Init:

%Entities = {'&katakana_o;' => 'オ', ... };

On HTML output:

s/(&[a-zA-Z0-9_]+;)/$Entities{$x}?$Entities{$x}:$x;/eg;

...

As a result, we should be able to use the customary input methods or cut-n-paste to put any characters into any of the wikis, which is certainly a lot easier than looking up entities or running text through a UTF-8-to-entities convertor (which is what I currently do).

-- brion vibber (brion @ pobox.com)

Hmmm. Wouldn't that need some modifications to browsers ?

Brion L. VIBBER

6:35 p.m.

On ven, 2002-03-08 at 04:07, Tomasz Wegrzanowski wrote:

...

On Thu, Mar 07, 2002 at 11:34:53PM -0800, Brion L. VIBBER wrote:

...
On ??a??, 2002-03-07 at 22:07, Tomasz Wegrzanowski wrote:

...
Just see articles about anything Japanese on English Wikipedia. They contain Japanese names of everything.

Sure, but more often kanji than kana, so special kana markup wouldn't be that big a win. See the thread "International Upgrades"; the vague plan is to standardise the internal character set and present the wikipedias in Unicode to capable browsers. (Please comment!)

Uhm, right. But most non-japanese people don't know names of too many kanjis, so kanjis aren't that important. ;) On the other hand more people that it is usually though know kana, so it might be beneficial for them.

But, what are people who don't know much Japanese going to _do_ with kana?

Speaking as someone with a very very poor command of the Japanese language, my own usage of Japanese characters on the non-Japanese wikipedias is limited to: * Demonstration of japanese characters in articles about the language * Showing the local form of a place, personal, or other name in articles about Japan and Japanese culture

The former are a limited genre (Jimbo's "special case"), and the latter are overwhelmingly kanji.

...

Hmmm. Now I think that some general method would be more useful: &katakana_a; &kanji_b; &hebrew_c; or &cyrilic_d;

Hmm. Perhaps you should take this up with the w3 and get these put into the next XHTML standard. :)

...

...
As a result, we should be able to use the customary input methods or cut-n-paste to put any characters into any of the wikis, which is certainly a lot easier than looking up entities or running text through a UTF-8-to-entities convertor (which is what I currently do).

Hmmm. Wouldn't that need some modifications to browsers ?

Only if you've got a really limited browser. (Perhaps Netscape 4, the bane of web developers worldwide, or a text-mode browser in a non UTF-8 locale.)

Mozilla/Netscape 6, Internet Explorer 5+, Konqueror (if fonts are set up right), you should have no problem. Configuring keyboards/input methods, of course, is a system-dependent matter. (Japanese input is notoriously difficult to set up on Unixish systems that aren't running a primarily Japanese locale; it's quite easy on relatively current Mac or Windows systems, though.)

-- brion vibber (brion @ pobox.com)

Jimmy Wales

5:20 p.m.

Brion L. VIBBER wrote:

...

Sure, but more often kanji than kana, so special kana markup wouldn't be that big a win. See the thread "International Upgrades"; the vague plan is to standardise the internal character set and present the wikipedias in Unicode to capable browsers. (Please comment!)

Really? There are kanji in articles about Japan? I mean, articles other than articles about the language or other special cases?

That seems odd to me. I'm not opposed to it, necessarily, but it seems very odd. I mean, there's no reason to expect that kanji will be useful to the vast majority of readers.

Can you send some examples?

--Jimbo

Tomasz Wegrzanowski

6:06 p.m.

On Fri, Mar 08, 2002 at 09:20:16AM -0800, Jimmy Wales wrote:

...

Brion L. VIBBER wrote:

...
Sure, but more often kanji than kana, so special kana markup wouldn't be that big a win. See the thread "International Upgrades"; the vague plan is to standardise the internal character set and present the wikipedias in Unicode to capable browsers. (Please comment!)

Really? There are kanji in articles about Japan? I mean, articles other than articles about the language or other special cases?

That seems odd to me. I'm not opposed to it, necessarily, but it seems very odd. I mean, there's no reason to expect that kanji will be useful to the vast majority of readers.

Can you send some examples?

Murasaki Shikibu Anime Princess Mononoke Neon Genesis Evangelion Miyazaki Hayao (and lot more)

In fact majority of Japan-related article have some kanjis.

Brion L. VIBBER

6:07 p.m.

On ven, 2002-03-08 at 09:20, Jimmy Wales wrote:

...

Brion L. VIBBER wrote:

...
Sure, but more often kanji than kana, so special kana markup wouldn't be that big a win. See the thread "International Upgrades"; the vague plan is to standardise the internal character set and present the wikipedias in Unicode to capable browsers. (Please comment!)

Really? There are kanji in articles about Japan?

Yeeessss.... You have such a difficult time accepting this. :)

...

I mean, articles other than articles about the language or other special cases?

Define "special cases".

...

That seems odd to me. I'm not opposed to it, necessarily, but it seems very odd. I mean, there's no reason to expect that kanji will be useful to the vast majority of readers.

No, but there's no reason to expect that any particular *article* will be useful to the vast majority of readers for that matter.

A few question marks or boxes in parentheses aren't going to drive non-Japanese readers mad with one look, but for those who *do* know it, they *do* get more information because they now can recognize the term in Japanese text, or usefully look it up in Japanese informational resources.

The English wikipedia isn't just for English monolinguals, is it?

...

Can you send some examples?

...

From a quick search...

Nagano, Japan Japan/Meiji Emperor Akihito of Japan Emperor Jimmu of Japan Satsuma Okinawa Hideki Tojo Tokyo Meiji-era leaders Shogun Koto Dejima Yen Tokugawa shoguns Kamakura shoguns Hanko Akihabara Samurai Cyprinus carpio Nintendo Nissan Ashikaga shoguns Kyoto Morihei Ueshiba Toyotomi Hideyoshi Jokichi Takamine Ju-jitsu Junichiro Koizumi World War II/Hiryu World War II/Kaga Iron Chef World War II/Soryu Akira Kurosawa Kamikaze World War II/Zuikaku Heisuke Hironaka Amakusa Tsurugi Raku Karaoke Kaifu Toshiki Toyota Tsunami Shibasaburo Kitasato Judo Gomoku Suzuki Kendo Zhu Shijie Isoroku Yamamoto Miyazaki Hayao Sushi Choshu Anime Otaku No Video The Vision of Escaflowne Kia Asayama Ghost in the Shell Tenchi Muyo Star Blazers Princess Mononoke Doraemon Hentai Masamune Shirow Manga My Neighbor Totoro Trigun Sailor Moon Ranma 1/2 Rumiko Takahashi

I'm sure I missed plenty. These can be broadly categorized as: * Geographical names, with the local native kanji "spelling" as a sidenote * Personal names of politicians, scientists, and artists, with their native kanji "spelling" as a sidenote * Various cultural items originating in Japan (sushi, karaoke, martial arts, companies, works of art/pop culture) with their native kanji/kana "spelling" as a sidenote

That said, I'm still not convinced there's much usefulness in more special codes that work on our wiki and nowhere else in the world; only a fraction of the above use kana at all.

-- brion vibber (brion @ pobox.com)

Tomasz Wegrzanowski

6:26 p.m.

On Fri, Mar 08, 2002 at 10:07:58AM -0800, Brion L. VIBBER wrote:

...

That said, I'm still not convinced there's much usefulness in more special codes that work on our wiki and nowhere else in the world; only a fraction of the above use kana at all.

*Anything* is better than using numerics. Many of them are mixed kana + kanji. That still saves half of the work.

Brion L. VIBBER

6:45 p.m.

On ven, 2002-03-08 at 10:26, Tomasz Wegrzanowski wrote:

...

On Fri, Mar 08, 2002 at 10:07:58AM -0800, Brion L. VIBBER wrote:

...
That said, I'm still not convinced there's much usefulness in more special codes that work on our wiki and nowhere else in the world; only a fraction of the above use kana at all.

*Anything* is better than using numerics. Many of them are mixed kana + kanji. That still saves half of the work.

What are you doing, looking up every character individually? No wonder you're having trouble!

What I currently do is to type the desired text into yudit (http://yudit.org) using its support for the kinput2 input method (or cut-n-paste into yudit from another web page), save the file, and run it through this little program:

#!/usr/bin/perl -p # disassemble non-ASCII codes from UTF-8 stream

# borrowed from http://czyborra.com/utf/

#$format=$ENV{"UCFORMAT"}||'<U%04X>'; $format='&#%d;'; s/([\xC0-\xDF])([\x80-\xBF])/sprintf($format, unpack("c",$1)<<6&0x07C0|unpack("c",$2)&0x003F)/ge; s/([\xE0-\xEF])([\x80-\xBF])([\x80-\xBF])/sprintf($format, unpack("c",$1)<<12&0xF000|unpack("c",$2)<<6&0x0FC0|unpack("c",$3)&0x003F)/ge; s/([\xF0-\xF7])([\x80-\xBF])([\x80-\xBF])([\x80-\xBF])/sprintf($format, unpack("c",$1)<<18&0x1C0000|unpack("c",$2)<<12&0x3F000| unpack("c",$3)<<6&0x0FC0|unpack("c",$4)&0x003F)/ge;

Paste the output into the Wikipedia edit box, and presto!

If I have one name, it gets done at once. If I have two names, they get done at once. If I put in a whole passage of text, it all gets done at once. It would actually be *more* work for me to separately write out the kana characters in special codes.

Once we've got the new system with Unicode up, you should be able to type or paste the characters in directly (unless you have a very limited browser, see my earlier post) and bypass all this rigamarole.

-- brion vibber (brion @ pobox.com)

Lars Aronsson

10 Mar 10 Mar

12:34 p.m.

Brion L. Vibber wrote:

...

On ven, 2002-03-08 at 09:20, Jimmy Wales wrote:

...
Really? There are kanji in articles about Japan?

Yeeessss.... You have such a difficult time accepting this. :)

For the matter of implementing the search engine, Latin search and Kanji search could be two different functions. Just like image search (Google style) is a third function and mathematic equations search could be a fourth kind of search, once LaTeX support is integrated in Wikipedia. To me, the kanji is just like images and I have no keyboard to input that in the search window anyway.

The English Wikipedia might implement all three searches, but the Norwegian and German ones might only need the Latin search (until a significant number of German Wikipedia pages have kanji, images or equations in them).

Perhaps this separation of implementations can help us get forward? I still have no advice for the non-Latin Wikipediae.

...

The English wikipedia isn't just for English monolinguals, is it?

Is this the new politically correct term for Americans? :-)

-- Lars Aronsson (lars@aronsson.se) Aronsson Datateknik Teknikringen 1e, SE-583 30 Linuxköping, Sweden tel +46-70-7891609 http://aronsson.se/ http://elektrosmog.nu/ http://susning.nu/

Brion L. VIBBER

4:44 p.m.

On dim, 2002-03-10 at 04:34, Lars Aronsson wrote:

...

Brion L. Vibber wrote:

...
On ven, 2002-03-08 at 09:20, Jimmy Wales wrote:

...
Really? There are kanji in articles about Japan?

Yeeessss.... You have such a difficult time accepting this. :)

For the matter of implementing the search engine, Latin search and Kanji search could be two different functions. Just like image search (Google style) is a third function and mathematic equations search could be a fourth kind of search, once LaTeX support is integrated in Wikipedia. To me, the kanji is just like images and I have no keyboard to input that in the search window anyway.

The English Wikipedia might implement all three searches, but the Norwegian and German ones might only need the Latin search (until a significant number of German Wikipedia pages have kanji, images or equations in them).

Perhaps this separation of implementations can help us get forward? I still have no advice for the non-Latin Wikipediae.

Well, my point is that there is no need whatsoever to make these separate functions. They work 100% THE SAME WAY. You put in text, it munges it a bit to make non-ASCII characters behave, and searches for it. No images involved.

No great hardship to the person who never types a kanji. (Or an o-with-umlaut!)

No reason to make them separate.

...

...
The English wikipedia isn't just for English monolinguals, is it?

Is this the new politically correct term for Americans? :-)

-- brion vibber (brion @ pobox.com)

Jimmy Wales

8 Mar 8 Mar

5:18 p.m.

I wrote:

...

...
In general, why would it ever be common that one would wish to put Kana into non-Japanese wikipedias?

At least at first thought (and I'm totally open to have my mind changed!) the only place in an English or Polish wikipedia where showing Kana would make sense would be an article _about_ Kana.

Tomasz Wegrzanowski wrote:

...

Just see articles about anything Japanese on English Wikipedia. They contain Japanese names of everything.

But shouldn't these Japanese names generally be written in the Roman alphabet (Romaji), not in Kana? If I open up an Encyclopedia Britannica article about 'anime' or 'sushi' or 'Hirohito' or 'Konoe Fumimaro' I don't expect to see kana, but Romaji.

I'm not a real stickler on this point; as I say, I could be convinced. I'm just saying that it strikes me as fairly odd to put Kana or Kanji character sets into other languages, except in some very special cases.

--Jimbo

Brion L. VIBBER

5:47 p.m.

On ven, 2002-03-08 at 09:18, Jimmy Wales wrote:

...

Tomasz Wegrzanowski wrote:

...
Just see articles about anything Japanese on English Wikipedia. They contain Japanese names of everything.

But shouldn't these Japanese names generally be written in the Roman alphabet (Romaji), not in Kana? If I open up an Encyclopedia Britannica article about 'anime' or 'sushi' or 'Hirohito' or 'Konoe Fumimaro' I don't expect to see kana, but Romaji.

Bring up the wikipedia article on [[Miyazaki Hayao]] (or, for that matter, [[Sushi]]) for an example of what we're talking about. Kanji/kana are provided as supplementary parenthetical information, while the main text uses the English name and, if different, the Romaji form.

...

I'm not a real stickler on this point; as I say, I could be convinced. I'm just saying that it strikes me as fairly odd to put Kana or Kanji character sets into other languages, except in some very special cases.

What other special case could there be than "something originating in culture X, here's its real name in the language of X in case you can read X and want to look up more information or, heck, are just curious".

-- brion vibber (brion @ pobox.com)

Jimmy Wales

5:45 p.m.

O.k., I'm starting to see the light on this.

Brion L. VIBBER wrote:

...

...
I'm not a real stickler on this point; as I say, I could be convinced. I'm just saying that it strikes me as fairly odd to put Kana or Kanji character sets into other languages, except in some very special cases.

What other special case could there be than "something originating in culture X, here's its real name in the language of X in case you can read X and want to look up more information or, heck, are just curious".

Oh, the special cases I had in mind were articles _about_ kana or kanji, or in cases where the kana or kanji are likely to be well-known in a certain context.

I don't know much about 'anime', for example, but I imagine that fans of anime are familiar with the kana for 'a ni me'. New people interested in that area may have seen those kana around but not yet grasped what they mean. So they'd be excited to find out by readin our 'a ni me' article.

That's different from just sticking a kanji in after someone's name.

But, now I'm starting to see the light. So long as this is just presented as parenthetical information, there's no harm and it could be very useful. Take the Sushi article as an example. Someone could use it to become familiar with the kanji for different things like nigirisushi, and then have more fun the next time at a Japanese restaurant.

Boku wa nihongo no gakusei. De mo watashi no nihongo joozu de wa arimasen. "I am a Japanese language student. But, my Japanese language is not proficient."

So consier me totally converted on this point.

(However, my Konquerer browser does not render these characters at all.)

--Jimbo

8170

Age (days ago)

8173

Last active (days ago)

wikitech-l@lists.wikimedia.org

15 comments

4 participants

tags (0)

participants (4)

Brion L. VIBBER
Jimmy Wales
Lars Aronsson
Tomasz Wegrzanowski