Hi Alfio,
I looked at your code. Nice job.
Superficially it may seem we did almost the same job. But overlap is minimal. My perl script addresses a lot of issues that only are relevant in a Palm/Pocket PC/TomeRaider environment.
Your version has quite some code which is specific for a static html version.
Still there are some areas where we can be of help to each other. You mentioned unicode support as an open issue. Conincidentally I was looking into this issue the past few days, while preparing a TomeRaider version of the Esperanto Wikipedia, which would be unreadable without it.
You will also find the UTF-8 coding scheme on which this is based below.
Here is some Perl code to translate unicode multicharacter byte sequences into html tags of type &#nnn;
# unicode -> html character codes &#nnnn; $entry =~ s/([\x80-\xFF]+)/&UnicodeToHtml($1)/ge ;
sub UnicodeToHtml { my $text = shift ; my $html = "" ; my $c, $byte, $ord, $unicode, $bytes, $html ; for ($c = 0 ; $c < length ($text) ; $c++) { $byte = substr ($text,$c,1) ; # optimize with regexp ? $ord = ord ($byte) ; if ($ord < 128) # plain ascii character { $html .= $byte ; } # (will not occur in this script) else { if ($ord < 224) { $bytes = 2 ; } elsif ($ord < 240) { $bytes = 3 ; } elsif ($ord < 248) { $bytes = 4 ; } elsif ($ord < 252) { $bytes = 5 ; } else { $bytes = 6 ; } $unicode = substr ($text,$c,$bytes) ; $html .= &UnicodeToHtmlTag ($unicode) ; $c += $bytes - 1 ; } } return ($html) ; }
sub UnicodeToHtmlTag { my $unicode = shift ; my $char = substr ($unicode,0,1) ; my $ord = ord ($char) ; my $c, $ord, $value ;
if ($ord < 128) # plain ascii character { return ($unicode) ; } # (will not occur in this script) else { if ($ord >= 252) { $value = $ord - 252 ; } elsif ($ord >= 248) { $value = $ord - 248 ; } elsif ($ord >= 240) { $value = $ord - 240 ; } elsif ($ord >= 224) { $value = $ord - 222 ; } else { $value = $ord - 192 ; } for ($c = 1 ; $c < length ($unicode) ; $c++) { $value = $value * 64 + ord (substr ($unicode, $c,1)) - 128 ; } return ("&#" . $value . ";") ; } }
Found this somewhere on the web:
#UTF-8 works as follows: #ENCODING # The following byte sequences are used to represent a char- # acter. The sequence to be used depends on the UCS code # number of the character: # 0x00000000 - 0x0000007F: # 0xxxxxxx # # 0x00000080 - 0x000007FF: # 110xxxxx 10xxxxxx # # 0x00000800 - 0x0000FFFF: # 1110xxxx 10xxxxxx 10xxxxxx # # 0x00010000 - 0x001FFFFF: # 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx # # 0x00200000 - 0x03FFFFFF: # 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx # # 0x04000000 - 0x7FFFFFFF: # 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx # # The xxx bit positions are filled with the bits of the # character code number in binary representation. Only the # shortest possible multibyte sequence which can represent # the code number of the character can be used.
By the way I enjoyed your contribution about Ant Power.
If you have any questions or suggestions you can reach me at xxx@chello.nl !spam: read xxx as epzachte
Cheers, Erik Zachte
wikitech-l@lists.wikimedia.org