Fields such as user.user_name, are varbinary. I'm writing this field as part of a json object in python; json.dumps() requires a string, not bytes, so I need to know the encoding. I assume these are all utf-8?
Hi, Roy,
I will answer assuming you are talking about wikireplicas/WMF production, or the default Mediawiki installation, sorry if I am mistaken.
Mediawiki at WMF uses primarily binary text fields for historical and functional reasons, and in fact, as far as I understand, Mediawiki is planned to drop 3-byte utf8 MySQL support. But indeed it uses UTF-8 as the underlying encoding. It is not unthinkable that you could find cases with invalid UTF-8 characters, so be ready to capture an exception, but those should be considered bugs in the underlying data and be reported for fix. It is very common when using Python with mediawiki to run:
.encode(encoding='UTF-8',errors='strict')
when dealing with these binary fields. Here is a reported bug of an exception: https://phabricator.wikimedia.org/T108434 .
As an addendum, note we recommend not doing this for your own personal databases, and using utf8mb4 (UTF-8), unless you want a carbon copy of mediawiki fields in the original format.
On Fri, Dec 13, 2019 at 5:05 AM Roy Smith roy@panix.com wrote:
Fields such as user.user_name, are varbinary. I'm writing this field as part of a json object in python; json.dumps() requires a string, not bytes, so I need to know the encoding. I assume these are all utf-8? _______________________________________________ Wikimedia Cloud Services mailing list Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud
On Dec 13, 2019, at 5:04 AM, Jaime Crespo jcrespo@wikimedia.org wrote:
Hi, Roy,
I will answer assuming you are talking about wikireplicas/WMF production, or the default Mediawiki installation
Yes. Specifically, the instance you get when you run "sql enwiki" from tools-sgebastion-08.
Mediawiki is planned to drop 3-byte utf8 MySQL support.
Hmmm, as far as I can tell, the change from utf8mb3 to utf8mb4 will be invisible to me, but good to know.
It is not unthinkable that you could find cases with invalid UTF-8 characters, so be ready to capture an exception
Good idea, I'll add the exception handing. Thanks.