Hello,

If you don't do anything with metadata fields of file tables (image table for example) in replicas, you can ignore this email.

"image" table in Wikimedia Commons is extremely big (more than 380GB compressed) and has been causing multiple major issues (including an incident recently). Deep inspections revealed that more than 80% of this table is metadata of PDF files, around 10% is metadata of DjVu files and the 10% left is the rest of the information. This clearly needs fixing.

The work has been done on this by Tim Starling and we are slowly rolling out two major changes:

First, format of metadata in the database (for example img_metadata field in image table) will change for all files. It used to be php serialization but it will be changed to json. You can see an example of before and after in https://phabricator.wikimedia.org/T275268#7178983 Keep it in mind that for some time this will be a hybrid mode that some files will have it in json format and some will have it in php serialization. You need to support both formats for a while if you parse this value.

Second, some parts of metadata for PDF and later DjVu files won't be accessible in Wikimedia Cloud anymore. Since these data will be moved to External Storage and ES is not accessible to the outside. It's mostly OCR text of PDF files. You can still access them using API (action=query&prop=imageinfo).

Nothing to the outside users will change, the API will return the same result, the user interface will show the same thing but it would make all of Wikimedia Commons more reliable and faster to access (by indirect changes such as improving InnoDB buffer pool efficiency), improves time to take database backups, enables us to make bigger changes on image table and improve its schema and much more.

I hope no one heavily relies on the img_metadata field in the cloud replicas but if you do, please let me know and reach out for help.

You can keep track of the work in https://phabricator.wikimedia.org/T275268

Thank you for understanding and sorry for any inconvenience.

Amir (he/him)