Hi Alfio,
I looked at your code. Nice job.
Superficially it may seem we did almost the same job.
But overlap is minimal. My perl script addresses a lot of issues that
only are relevant in a Palm/Pocket PC/TomeRaider environment.
Your version has quite some code which is specific for a static html
version.
Still there are some areas where we can be of help to each other.
You mentioned unicode support as an open issue. Conincidentally I was
looking into this issue the past few days, while preparing a TomeRaider
version of the Esperanto Wikipedia, which would be unreadable without
it.
You will also find the UTF-8 coding scheme on which this is based below.
Here is some Perl code to translate unicode multicharacter byte
sequences into html tags of type &#nnn;
# unicode -> html character codes &#nnnn;
$entry =~ s/([\x80-\xFF]+)/&UnicodeToHtml($1)/ge ;
sub UnicodeToHtml
{
my $text = shift ;
my $html = "" ;
my $c, $byte, $ord, $unicode, $bytes, $html ;
for ($c = 0 ; $c < length ($text) ; $c++)
{
$byte = substr ($text,$c,1) ; # optimize with regexp ?
$ord = ord ($byte) ;
if ($ord < 128) # plain ascii character
{ $html .= $byte ; } # (will not occur in this script)
else
{
if ($ord < 224)
{ $bytes = 2 ; }
elsif ($ord < 240)
{ $bytes = 3 ; }
elsif ($ord < 248)
{ $bytes = 4 ; }
elsif ($ord < 252)
{ $bytes = 5 ; }
else
{ $bytes = 6 ; }
$unicode = substr ($text,$c,$bytes) ;
$html .= &UnicodeToHtmlTag ($unicode) ;
$c += $bytes - 1 ;
}
}
return ($html) ;
}
sub UnicodeToHtmlTag
{
my $unicode = shift ;
my $char = substr ($unicode,0,1) ;
my $ord = ord ($char) ;
my $c, $ord, $value ;
if ($ord < 128) # plain ascii character
{ return ($unicode) ; } # (will not occur in this script)
else
{
if ($ord >= 252)
{ $value = $ord - 252 ; }
elsif ($ord >= 248)
{ $value = $ord - 248 ; }
elsif ($ord >= 240)
{ $value = $ord - 240 ; }
elsif ($ord >= 224)
{ $value = $ord - 222 ; }
else
{ $value = $ord - 192 ; }
for ($c = 1 ; $c < length ($unicode) ; $c++)
{ $value = $value * 64 + ord (substr ($unicode, $c,1)) - 128 ; }
return ("\&\#" . $value . ";") ;
}
}
Found this somewhere on the web:
#UTF-8 works as follows:
#ENCODING
# The following byte sequences are used to represent a char-
# acter. The sequence to be used depends on the UCS code
# number of the character:
# 0x00000000 - 0x0000007F:
# 0xxxxxxx
#
# 0x00000080 - 0x000007FF:
# 110xxxxx 10xxxxxx
#
# 0x00000800 - 0x0000FFFF:
# 1110xxxx 10xxxxxx 10xxxxxx
#
# 0x00010000 - 0x001FFFFF:
# 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
#
# 0x00200000 - 0x03FFFFFF:
# 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
#
# 0x04000000 - 0x7FFFFFFF:
# 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
#
# The xxx bit positions are filled with the bits of the
# character code number in binary representation. Only the
# shortest possible multibyte sequence which can represent
# the code number of the character can be used.
By the way I enjoyed your contribution about Ant Power.
If you have any questions or suggestions you can reach me at
xxx(a)chello.nl
!spam: read xxx as epzachte
Cheers, Erik Zachte