Re: [Wikitech-l] Architectural revisions to improve category sorting

17 Aug 2010

      On 17 August 2010 13:06, Nikola Smolenski smolensk@eunet.rs wrote:
...
Дана Tuesday 17 August 2010 20:37:44 Aryeh Gregor написа:
...
The code is currently enabled in trunk and is still awaiting review.
It's basically complete, but there are some issues left:

What sortkey algorithm to use?  Currently it just ASCII uppercases

the words, which is okay for a proof-of-concept but doesn't actually
solve bug 164.
For some time now, I am thinking about a stupidly simple solution:
php -r 'for($i = 0; $i < 65536; $i++) { echo pack("nx", $i); echo "\n"; }'|
iconv -f ucs-2be -t utf8 | sort | php -r 'foreach(file("php://stdin") as $v)
{ echo var_export(substr($v, 0, -1)) . " => "" . str_pad(base_convert($i,
10, 36), 4, 0, STR_PAD_LEFT) . "",\n"; $i++; }'
This, more or less, should:

Print every Unicode (UCS-2 only) character on its own line
Sort that according to the current locale
Print a PHP array to replace each Unicode character (UTF-8 encoded) with

appropriate base36 number
If an UTF-8 string is encoded with this array, the resulting strings should be
sorted exactly the same as in the locale through mere ASCII sorting. Or am I
missing something big? (Except contextual sensitivity, but it occurs
relatively rarely and this should still be better than what we have now.)
You are missing most of it :). In many cases a single "letter" is made
up of multiple code-points (of which there are considerably more than
65536 by the way) - think of Hungarian gy, then there are all kinds of
conventions for sorting accents - in French you sort á after a but
only if the rest of the word is spelt the same (i.e ab <- áb <- ac).
There is the ICU, and it is available to PHP (in some versions)
http://docs.php.net/manual/en/class.collator.php, using those sort
keys should be "good enough" for now I imagine. There are languages on
Wiktionary that won't be in the ICU yet (just because they are
ludicrously obscure) but it's probably best to start with something
manageable.
Conrad

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] Architectural revisions to improve category sorting