Invalid unicode byte sequences in metadata. - Wikitech-l

30 Oct 2005


      In the past, whenever I've imported the English Wikipedia metadata
into PostgreSQL I've been forced to use bytea fields because
PostgreSQL's UTF-8 support was limited to BMP characters.  Recently
PostgreSQL has been extended to have full support for all valid UTF-8
codes, including 4 byte ones.
I'd stopped using PG for my Wikipedia work because I'm trying to build
a compariable level of MySQL skill to help me better work with the
project, but I've found that MySQL's query engine is just too stupid
for the analysis queries I run, and that I must use PG if I don't want
to grow old waiting for somewhat fancy queries (like stuff with
subselects) to finish.
I've run into some cases in the Wikipedia metadata where there are
invalid UTF-8 sequences (i.e. not an issue of non-BMP, it's just plain
invalid). Many of these appear to be in the imagelinks table, where
they are describing images that don't exist.  I handle these on import
by feeding the text through iconv -c -f UTF8 -t UTF8, which cleans
them up okay.
Since most of these are in places where I can just go fix them, I've
been doing a little of that, and will probably eventually go around
and get them all... I'm assuming that these were entered in before we
were correctly filtering text, but I'm somewhat concerned that there
may be some data entry paths which are not being filtered. Is this
possible?   If so, I'll create some test cases.