Simetrical wrote:
On 4/25/07, Neil Harris
<usenet(a)tonal.clara.co.uk> wrote:
To reverse the process, first percent-decode the
URL as needed, then
decode the resulting UTF-8 byte string into Unicode.
For example,
Fabry-P%C3%A9rot_interferometer
decodes to
Fabry-Pérot interferometer
...since %C3%A9 decodes to the two bytes 0xC3 0xA9, which is the UTF-8
encoding of Unicode code point U+00E9, which encodes the character "é".
That step is unnecessary if you're using a language like PHP1-5 that's
encoding-agnostic. It will decode to bytes that can be directly
output to a UTF-8-encoded page or stream, where they'll display
correctly. The conversion step is only possibly useful if you use a
language that distinguishes between Unicode and binary strings, and
it's not necessary there. The only thing is to be sure that whatever
you're passing it to or processing it with will interpret it as UTF-8,
if that distinction is relevant (which it probably is if the display
name is what's desired).
Basically, yes, it's standard urldecode() followed by replacement of
underscores with spaces.
...which is completely correct for anyone using UTF-8 as their default
encoding for _everything_, or whose applications are allcharset-aware,
and can carry the charset information around with their data.
However, for anyone else, the distinction _is_ an important real-world
consideration, rather than pedantic, if your apps or operating system
uses any other default character set or encoding (for example, GB
18030, UTF-16, UCS-4, EUC-CN, EUC-KR, ...) in which case you'll get
mojibake, rather than legible article titles.
-- Neil