Hi everyone,
For this week's office hour, we'd like to discuss [T589: schema change for page content language][1] Timo brought this up on the list a couple of weeks ago, and didn't get much of a response. (But thank you Jaime for responding!) We discussed this one fairly recently ([July 13][2]). This area of our system is still in need of simplification and optimization.
The discussion is scheduled for 2016-08-31 UTC: Time: Wednesday 21 UTC (2pm PDT, 23 CEST) Place: #wikimedia-office Phab event: [E266][3] [ArchCom/Status][4]
Rob
[1]: https://phabricator.wikimedia.org/T589 [3]: https://phabricator.wikimedia.org/E228 July 13 meeting [3]: https://phabricator.wikimedia.org/E266 Upcoming meeting [4]: https://www.mediawiki.org/wiki/Architecture_committee/Status ---------- Forwarded message ---------- From: Krinkle krinklemail@gmail.com Date: Wed, Aug 10, 2016 at 1:54 PM Subject: [Wikitech-l] Schema migration for 'image' and 'oldimage' tables To: Wikimedia developers wikitech-l@lists.wikimedia.org
TL;DR: Participate on T589 and help decide what the upcoming schema change should entail, and how we'll migrate existing data.
Hey all,
Couple weeks ago we dedicated an IRC office hour to https://phabricator.wikimedia.org/T589 (RFC: image and oldimage tables).
Updated draft at: https://www.mediawiki.org/wiki/Requests_for_comment/image_and_oldimage_table...
We clarified scope and purpose of this particular RFC. Other issues are still important but considered orthogonal, and to be dealt with parallelly (or at a later time).
Revised problem statement:
1. File revisions should have unique identifiers (better than "current file title + upload timestamp". (Subject to race conditions, hard to index/query, etc.) 2. Uploading new file revisions must not involve rows moving across tables, or rows being replaced.
Participants agreed with the revised problem statement, it makes sense not to merely add primary keys to the existing tables ("Proposal 1" on the RFC draft), as that wouldn't adequately solve the Problem 2.
The second proposal was to separate information about image revision from the image entity itself. Similar to the page/revisions tables. This was generally accepted as a good idea, but details are still to be determined.
The general idea is that all revision-specific information (except for a pointer to the current revision) would no longer live in the 'image' table. Instead, information about all (for both current and past revisions) would live in the same table (instead of being moved around from one table to another when it's no longer the current one).
Details at: https://www.mediawiki.org/wiki/Requests_for_comment/image_and_oldimage_table...
Some open questions I'd like to see discussed on Phabricator (or here on wikitech):
1. Which fields do we keep in the 'image' table (img_id, img_name, img_latest, anything else?).
All fields currently being queried from both tables, will probably only stay in the image revision table. But there are a few fields that we intentionally only want to query about current versions. For example 'img_sha1'. For duplicate-detection, we need to only consider the latest revisions. Doing this by keeping img_sha1 means uploading a new revision will involve updating two fields instead of one (img_latest and img_sha1). This isn't unprecedented as we do this for page as well (WikiPage::updateRevisionOn; page_latest, page_touched, page_is_redirect, page_len).
Are there other fields we need to keep besides img_sha1? Or should we can solve the img_sha1 use case in a different manner?
2. img_metadata
This field is a blob of serialised PHP (typically representing the Exif data of an image).
Tim (correct me if I got it wrong) mentioned we could potentially make migration easier by changing img_metadata to be stored in a separate table and change the img_metadata field (in the image revision table) to instead be a pointer to a primary key.
This could potentially be done separately later, but if it helps migration, we should consider doing it now.
How will this interact with file deletion? Will it be difficult to garbage collect this? Do we need to? (We don't seem to do it for the 'text' table / external store; is it worth moving this an external store?)
3. Migration
If we rename both tables (image/oldimage -> file/filerevision), we'd have the ability to run migration in the background without interfering with the live site, and without requiring a long read-only period and/or duplicate and additional code complexity to be developed.
Is there a way we can do the migration without creating two new tables? Using the oldimage table as import destination for current rows isn't straight forward as existing scan queries would need to skip the current rows somehow while in the midst of this migration. Seems possible, but is it worth the complexity? (We'd need extra code that knows about that migration field, and how long do we keep that code? Also complicates migration for third-parties using update.php).
Is creating the new tables separately viable for the scale of Wikimedia Commons? (and dropping the old ones once finished). Is this a concern from a DBA perspective with regards to storage space? (We'd temporarily need about twice the space for these tables). So far I understood that it wouldn't be a problem per se, but that there are also other options we can explore for Wikimedia. For example we could use a separate set of slaves and alter those while depooled (essentially using entirely separate set of db slaves instead of a separate table within each slave).
Do we create the new table(s) separately and switch over once it's caught up? This would require doing multiple passes as we depool slaves one by one (we've done that before at Wikimedia). Switch-over could be done by migrating before the software upgrade, with a very short read-only period after the last pass is finished. It wouldn't require maintaining multiple code paths, which is attractive.
Other ideas?
-- Timo _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l