Schema migration for 'image' and 'oldimage' tables - Wikitech-l

11 Aug 2016


      TL;DR: Participate on T589 and help decide what the upcoming schema change
should entail, and how we'll migrate existing data.
Hey all,
Couple weeks ago we dedicated an IRC office hour to
https://phabricator.wikimedia.org/T589 (RFC: image and oldimage tables).
Updated draft at:
https://www.mediawiki.org/wiki/Requests_for_comment/image_and_oldimage_table...
We clarified scope and purpose of this particular RFC. Other issues are
still important but considered orthogonal, and to be dealt with parallelly
(or at a later time).
Revised problem statement:
1. File revisions should have unique identifiers (better than "current file
title + upload timestamp". (Subject to race conditions, hard to
index/query, etc.)
2. Uploading new file revisions must not involve rows moving across tables,
or rows being replaced.
Participants agreed with the revised problem statement, it makes sense not
to merely add primary keys to the existing tables ("Proposal 1" on the RFC
draft), as that wouldn't adequately solve the Problem 2.
The second proposal was to separate information about image revision from
the image entity itself. Similar to the page/revisions tables. This was
generally accepted as a good idea, but details are still to be determined.
The general idea is that all revision-specific information (except for a
pointer to the current revision) would no longer live in the 'image' table.
Instead, information about all (for both current and past revisions) would
live in the same table (instead of being moved around from one table to
another when it's no longer the current one).
Details at:
https://www.mediawiki.org/wiki/Requests_for_comment/image_and_oldimage_table...
Some open questions I'd like to see discussed on Phabricator (or here on
wikitech):
1. Which fields do we keep in the 'image' table (img_id, img_name,
img_latest, anything else?).
All fields currently being queried from both tables, will probably only
stay in the image revision table. But there are a few fields that we
intentionally only want to query about current versions. For example
'img_sha1'. For duplicate-detection, we need to only consider the latest
revisions. Doing this by keeping img_sha1 means uploading a new revision
will involve updating two fields instead of one (img_latest and img_sha1).
This isn't unprecedented as we do this for page as well
(WikiPage::updateRevisionOn; page_latest, page_touched, page_is_redirect,
page_len).
Are there other fields we need to keep besides img_sha1? Or should we can
solve the img_sha1 use case in a different manner?
2. img_metadata
This field is a blob of serialised PHP (typically representing the Exif
data of an image).
Tim (correct me if I got it wrong) mentioned we could potentially make
migration easier by changing img_metadata to be stored in a separate table
and change the img_metadata field (in the image revision table) to instead
be a pointer to a primary key.
This could potentially be done separately later, but if it helps migration,
we should consider doing it now.
How will this interact with file deletion? Will it be difficult to garbage
collect this? Do we need to? (We don't seem to do it for the 'text' table /
external store; is it worth moving this an external store?)
3. Migration
If we rename both tables (image/oldimage -> file/filerevision), we'd have
the ability to run migration in the background without interfering with the
live site, and without requiring a long read-only period and/or duplicate
and additional code complexity to be developed.
Is there a way we can do the migration without creating two new tables?
Using the oldimage table as import destination for current rows isn't
straight forward as existing scan queries would need to skip the current
rows somehow while in the midst of this migration. Seems possible, but is
it worth the complexity? (We'd need extra code that knows about that
migration field, and how long do we keep that code? Also complicates
migration for third-parties using update.php).
Is creating the new tables separately viable for the scale of Wikimedia
Commons? (and dropping the old ones once finished). Is this a concern from
a DBA perspective with regards to storage space? (We'd temporarily need
about twice the space for these tables). So far I understood that it
wouldn't be a problem per se, but that there are also other options we can
explore for Wikimedia. For example we could use a separate set of slaves
and alter those while depooled (essentially using entirely separate set of
db slaves instead of a separate table within each slave).
Do we create the new table(s) separately and switch over once it's caught
up? This would require doing multiple passes as we depool slaves one by one
(we've done that before at Wikimedia). Switch-over could be done by
migrating before the software upgrade, with a very short read-only period
after the last pass is finished. It wouldn't require maintaining multiple
code paths, which is attractive.
Other ideas?
-- Timo