Good day, This is the weekly update from the Search Platform team for the week starting 2018-11-12
Programming note: Given the upcoming US holiday the next update will be for the week starting 2018-11-26.
As always, feedback and questions welcome.
== Discussions ==
=== Search === * David and Trey have resolved the problems with 32-bit Chinese characters (like 𨨏—[0]), which were returning irrelevant results, and showing lots of unicode replacement characters (�) in the results. The highlighter fix was deployed [1] first so there aren't any more � characters in the results. The re-indexing [2] to improve the relevance of results is now also done for Chinese-language wikis.
== Did you know? == * Letters are encoded internally by computers as numbers—for example, “A” is 65 and “a” is 97.[3] Years ago, programs and even websites would use different encodings[4] to represent text, often leading to unreadable gibberish on screen. Unicode[5] was intended to be a single encoding for most of the world’s writing systems. The most-used parts of it fit into a 16-bit representation,[6] which can handle about 65 thousand characters. But that's not enough for the very large number of rare and historical Chinese, Japanese, and Korean (CJK) characters, which are represented in 16-bit Unicode using “surrogate pairs”.[7] 1,024 Unicode characters are set aside to be “high surrogates”—the first half of a 32-bit character—and 1,024 characters are set aside to be “low surrogates”—the second half. By themselves, the surrogates aren’t valid and don’t represent anything, but in pairs they can represent over a million additional characters. Since these characters are usually rare, software can sometimes treat them incorrectly split them up, which can result in you seeing the Unicode replacement character �,[8] which is used when something has gone wrong processing Unicode text. (When the character is fine, but you don’t have a font to show it, you sometimes get little squares instead. Since the most common source of these squares for English speakers is unrepresented CJK characters, a slang term for the squares is “tofu”.[9])
[0] https://phabricator.wikimedia.org/T168427 [1] https://phabricator.wikimedia.org/T209293 [2] https://phabricator.wikimedia.org/T209156 [3] https://en.wikipedia.org/wiki/ASCII#Printable_characters [4] https://en.wikipedia.org/wiki/Character_encoding#Common_character_encodings [5] https://en.wikipedia.org/wiki/Unicode [6] https://en.wikipedia.org/wiki/UTF-16 [7] https://en.wikipedia.org/wiki/Universal_Character_Set_characters#Surrogates [8] https://en.wikipedia.org/wiki/Specials_(Unicode_block)#Replacement_character [9] https://en.wiktionary.org/wiki/tofu#Noun
----
Subscribe to receive on-wiki (or opt-in email) notifications of the Discovery weekly update.
https://www.mediawiki.org/wiki/Newsletter:Discovery_Weekly
The archive of all past updates can be found on MediaWiki.org:
https://www.mediawiki.org/wiki/Discovery/Status_updates
Interested in getting involved? See tasks marked as "Easy" or "Volunteer needed" in Phabricator.
[1] https://phabricator.wikimedia.org/maniphest/query/qW51XhCCd8.7/#R [2] https://phabricator.wikimedia.org/maniphest/query/5KEPuEJh9TPS/#R
Many thanks, Chris Koerner Community Relations Specialist Wikimedia Foundation
Hi Chris,
On 20 November 2018 03:39:02 GMT+05:30, Chris Koerner ckoerner@wikimedia.org wrote:
== Did you know? ==
Thanks for the informative did you know section. It was an interesting read. :-)
- Letters are encoded internally by computers as numbers—for example,
“A” is 65 and “a” is 97.[3] Years ago, programs and even websites would use different encodings[4] to represent text, often leading to unreadable gibberish on screen. Unicode[5] was intended to be a single encoding for most of the world’s writing systems. The most-used parts of it fit into a 16-bit representation,[6] which can handle about 65 thousand characters. But that's not enough for the very large number of rare and historical Chinese, Japanese, and Korean (CJK) characters, which are represented in 16-bit Unicode using “surrogate pairs”.[7] 1,024 Unicode characters are set aside to be “high surrogates”—the first half of a 32-bit character—and 1,024 characters are set aside to be “low surrogates”—the second half. By themselves, the surrogates aren’t valid and don’t represent anything, but in pairs they can represent over a million additional characters. Since these characters are usually rare, software can sometimes treat them incorrectly split them up, which can result in you seeing the Unicode replacement character �,[8] which is used when something has gone wrong processing Unicode text. (When the character is fine, but you don’t have a font to show it, you sometimes get little squares instead. Since the most common source of these squares for English speakers is unrepresented CJK characters, a slang term for the squares is “tofu”.[9])
[0] https://phabricator.wikimedia.org/T168427 [1] https://phabricator.wikimedia.org/T209293 [2] https://phabricator.wikimedia.org/T209156 [3] https://en.wikipedia.org/wiki/ASCII#Printable_characters [4] https://en.wikipedia.org/wiki/Character_encoding#Common_character_encodings [5] https://en.wikipedia.org/wiki/Unicode [6] https://en.wikipedia.org/wiki/UTF-16 [7] https://en.wikipedia.org/wiki/Universal_Character_Set_characters#Surrogates [8] https://en.wikipedia.org/wiki/Specials_(Unicode_block)#Replacement_character [9] https://en.wiktionary.org/wiki/tofu#Noun
I also always really enjoy these, thanks!🐙
Am Mi., 21. Nov. 2018 um 04:41 Uhr schrieb Kaartic Sivaraam < kaarticsivaraam91196@gmail.com>:
Hi Chris,
On 20 November 2018 03:39:02 GMT+05:30, Chris Koerner < ckoerner@wikimedia.org> wrote:
== Did you know? ==
Thanks for the informative did you know section. It was an interesting read. :-)
- Letters are encoded internally by computers as numbers—for example,
“A” is 65 and “a” is 97.[3] Years ago, programs and even websites would use different encodings[4] to represent text, often leading to unreadable gibberish on screen. Unicode[5] was intended to be a single encoding for most of the world’s writing systems. The most-used parts of it fit into a 16-bit representation,[6] which can handle about 65 thousand characters. But that's not enough for the very large number of rare and historical Chinese, Japanese, and Korean (CJK) characters, which are represented in 16-bit Unicode using “surrogate pairs”.[7] 1,024 Unicode characters are set aside to be “high surrogates”—the first half of a 32-bit character—and 1,024 characters are set aside to be “low surrogates”—the second half. By themselves, the surrogates aren’t valid and don’t represent anything, but in pairs they can represent over a million additional characters. Since these characters are usually rare, software can sometimes treat them incorrectly split them up, which can result in you seeing the Unicode replacement character �,[8] which is used when something has gone wrong processing Unicode text. (When the character is fine, but you don’t have a font to show it, you sometimes get little squares instead. Since the most common source of these squares for English speakers is unrepresented CJK characters, a slang term for the squares is “tofu”.[9])
[0] https://phabricator.wikimedia.org/T168427 [1] https://phabricator.wikimedia.org/T209293 [2] https://phabricator.wikimedia.org/T209156 [3] https://en.wikipedia.org/wiki/ASCII#Printable_characters [4]
https://en.wikipedia.org/wiki/Character_encoding#Common_character_encodings
[5] https://en.wikipedia.org/wiki/Unicode [6] https://en.wikipedia.org/wiki/UTF-16 [7]
https://en.wikipedia.org/wiki/Universal_Character_Set_characters#Surrogates
[8]
https://en.wikipedia.org/wiki/Specials_(Unicode_block)#Replacement_character
-- Sivaraam
Sent from my Android device with K-9 Mail. Please excuse my brevity.
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
wikitech-l@lists.wikimedia.org