Hi, Roy,

I will answer assuming you are talking about wikireplicas/WMF production, or the default Mediawiki installation, sorry if I am mistaken.

Mediawiki at WMF uses primarily binary text fields for historical and functional reasons, and in fact, as far as I understand, Mediawiki is planned to drop 3-byte utf8 MySQL support. But indeed it uses UTF-8 as the underlying encoding. It is not unthinkable that you could find cases with invalid UTF-8 characters, so be ready to capture an exception, but those should be considered bugs in the underlying data and be reported for fix. It is very common when using Python with mediawiki to run:
.encode(encoding='UTF-8',errors='strict')
when dealing with these binary fields. Here is a reported bug of an exception: https://phabricator.wikimedia.org/T108434 .

As an addendum, note we recommend not doing this for your own personal databases, and using utf8mb4 (UTF-8), unless you want a carbon copy of mediawiki fields in the original format.

On Fri, Dec 13, 2019 at 5:05 AM Roy Smith <roy@panix.com> wrote:
Fields such as user.user_name, are varbinary.  I'm writing this field as part of a json object in python; json.dumps() requires a string, not bytes, so I need to know the encoding.  I assume these are all utf-8?
_______________________________________________
Wikimedia Cloud Services mailing list
Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org)
https://lists.wikimedia.org/mailman/listinfo/cloud


--
Jaime Crespo
<http://wikimedia.org>