In the past, whenever I've imported the English Wikipedia metadata into PostgreSQL I've been forced to use bytea fields because PostgreSQL's UTF-8 support was limited to BMP characters. Recently PostgreSQL has been extended to have full support for all valid UTF-8 codes, including 4 byte ones.
I'd stopped using PG for my Wikipedia work because I'm trying to build a compariable level of MySQL skill to help me better work with the project, but I've found that MySQL's query engine is just too stupid for the analysis queries I run, and that I must use PG if I don't want to grow old waiting for somewhat fancy queries (like stuff with subselects) to finish.
I've run into some cases in the Wikipedia metadata where there are invalid UTF-8 sequences (i.e. not an issue of non-BMP, it's just plain invalid). Many of these appear to be in the imagelinks table, where they are describing images that don't exist. I handle these on import by feeding the text through iconv -c -f UTF8 -t UTF8, which cleans them up okay.
Since most of these are in places where I can just go fix them, I've been doing a little of that, and will probably eventually go around and get them all... I'm assuming that these were entered in before we were correctly filtering text, but I'm somewhat concerned that there may be some data entry paths which are not being filtered. Is this possible? If so, I'll create some test cases.
Gregory Maxwell wrote:
Since most of these are in places where I can just go fix them, I've been doing a little of that, and will probably eventually go around and get them all... I'm assuming that these were entered in before we were correctly filtering text,
Yes, some old ones in images etc still.
but I'm somewhat concerned that there may be some data entry paths which are not being filtered. Is this possible? If so, I'll create some test cases.
Unlikely but possible. Check WebRequest.php etc.
-- brion vibber (brion @ pobox.com)
wikitech-l@lists.wikimedia.org