RE: Static html - Wikitech-l

29 May 2003


      Hi Alfio,
I looked at your code. Nice job.
Superficially it may seem we did almost the same job.
But overlap is minimal. My perl script addresses a lot of issues that
only are relevant in a Palm/Pocket PC/TomeRaider environment.
Your version has quite some code which is specific for a static html
version.
Still there are some areas where we can be of help to each other.
You mentioned unicode support as an open issue. Conincidentally I was
looking into this issue the past few days, while preparing a TomeRaider
version of the Esperanto Wikipedia, which would be unreadable without
it.
You will also find the UTF-8 coding scheme on which this is based below.
Here is some Perl code to translate unicode multicharacter byte
sequences into html tags of type &#nnn;
# unicode -> html character codes &#nnnn;
$entry =~ s/([\x80-\xFF]+)/&UnicodeToHtml($1)/ge ;
sub UnicodeToHtml
{
  my $text  = shift ;
  my $html = "" ;
  my $c, $byte, $ord, $unicode, $bytes, $html ;
  for ($c = 0 ; $c < length ($text) ; $c++)
  {
    $byte = substr ($text,$c,1) ; # optimize with regexp ?
    $ord = ord ($byte) ;
    if ($ord < 128)      # plain ascii character
    { $html .= $byte ; } # (will not occur in this script)
    else
    {
      if ($ord < 224)
      { $bytes = 2 ; }
      elsif ($ord < 240)
      { $bytes = 3 ; }
      elsif ($ord < 248)
      { $bytes = 4 ; }
      elsif ($ord < 252)
      { $bytes = 5 ; }
      else
      { $bytes = 6 ; }
      $unicode = substr ($text,$c,$bytes) ;
      $html .= &UnicodeToHtmlTag ($unicode) ;
      $c += $bytes - 1 ;
    }
  }
  return ($html) ;
}
sub UnicodeToHtmlTag
{
  my $unicode = shift ;
  my $char = substr ($unicode,0,1) ;
  my $ord = ord ($char) ;
  my $c, $ord, $value ;
if ($ord < 128)         # plain ascii character
  { return ($unicode) ; } # (will not occur in this script)
  else
  {
    if ($ord >= 252)
    { $value = $ord - 252 ; }
    elsif ($ord >= 248)
    { $value = $ord - 248 ; }
    elsif ($ord >= 240)
    { $value = $ord - 240 ; }
    elsif ($ord >= 224)
    { $value = $ord - 222 ; }
    else
    { $value = $ord - 192 ; }
    for ($c = 1 ; $c < length ($unicode) ; $c++)
    { $value = $value * 64 + ord (substr ($unicode, $c,1)) - 128 ; }
    return ("&#" . $value . ";") ;
  }
}
Found this somewhere on the web:
#UTF-8 works as follows:
#ENCODING
#       The following byte sequences are used to represent a char-
#       acter. The sequence to be used depends  on  the  UCS  code
#       number of the character:
#       0x00000000 - 0x0000007F:
#           0xxxxxxx
#
#       0x00000080 - 0x000007FF:
#           110xxxxx 10xxxxxx
#
#       0x00000800 - 0x0000FFFF:
#           1110xxxx 10xxxxxx 10xxxxxx
#
#       0x00010000 - 0x001FFFFF:
#           11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
#
#       0x00200000 - 0x03FFFFFF:
#           111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
#
#       0x04000000 - 0x7FFFFFFF:
#           1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
#
#       The  xxx  bit  positions  are  filled with the bits of the
#       character code number in binary representation.  Only  the
#       shortest  possible  multibyte sequence which can represent
#       the code number of the character can be used.
By the way I enjoyed your contribution about Ant Power.
If you have any questions or suggestions you can reach me at
xxx@chello.nl
!spam: read xxx as epzachte
Cheers, Erik Zachte