TL;DR: Participate on T589 and help decide what the upcoming schema change should entail, and how we'll migrate existing data.
Hey all,
Couple weeks ago we dedicated an IRC office hour to https://phabricator.wikimedia.org/T589 (RFC: image and oldimage tables).
Updated draft at: https://www.mediawiki.org/wiki/Requests_for_comment/image_and_oldimage_table...
We clarified scope and purpose of this particular RFC. Other issues are still important but considered orthogonal, and to be dealt with parallelly (or at a later time).
Revised problem statement:
1. File revisions should have unique identifiers (better than "current file title + upload timestamp". (Subject to race conditions, hard to index/query, etc.) 2. Uploading new file revisions must not involve rows moving across tables, or rows being replaced.
Participants agreed with the revised problem statement, it makes sense not to merely add primary keys to the existing tables ("Proposal 1" on the RFC draft), as that wouldn't adequately solve the Problem 2.
The second proposal was to separate information about image revision from the image entity itself. Similar to the page/revisions tables. This was generally accepted as a good idea, but details are still to be determined.
The general idea is that all revision-specific information (except for a pointer to the current revision) would no longer live in the 'image' table. Instead, information about all (for both current and past revisions) would live in the same table (instead of being moved around from one table to another when it's no longer the current one).
Details at: https://www.mediawiki.org/wiki/Requests_for_comment/image_and_oldimage_table...
Some open questions I'd like to see discussed on Phabricator (or here on wikitech):
1. Which fields do we keep in the 'image' table (img_id, img_name, img_latest, anything else?).
All fields currently being queried from both tables, will probably only stay in the image revision table. But there are a few fields that we intentionally only want to query about current versions. For example 'img_sha1'. For duplicate-detection, we need to only consider the latest revisions. Doing this by keeping img_sha1 means uploading a new revision will involve updating two fields instead of one (img_latest and img_sha1). This isn't unprecedented as we do this for page as well (WikiPage::updateRevisionOn; page_latest, page_touched, page_is_redirect, page_len).
Are there other fields we need to keep besides img_sha1? Or should we can solve the img_sha1 use case in a different manner?
2. img_metadata
This field is a blob of serialised PHP (typically representing the Exif data of an image).
Tim (correct me if I got it wrong) mentioned we could potentially make migration easier by changing img_metadata to be stored in a separate table and change the img_metadata field (in the image revision table) to instead be a pointer to a primary key.
This could potentially be done separately later, but if it helps migration, we should consider doing it now.
How will this interact with file deletion? Will it be difficult to garbage collect this? Do we need to? (We don't seem to do it for the 'text' table / external store; is it worth moving this an external store?)
3. Migration
If we rename both tables (image/oldimage -> file/filerevision), we'd have the ability to run migration in the background without interfering with the live site, and without requiring a long read-only period and/or duplicate and additional code complexity to be developed.
Is there a way we can do the migration without creating two new tables? Using the oldimage table as import destination for current rows isn't straight forward as existing scan queries would need to skip the current rows somehow while in the midst of this migration. Seems possible, but is it worth the complexity? (We'd need extra code that knows about that migration field, and how long do we keep that code? Also complicates migration for third-parties using update.php).
Is creating the new tables separately viable for the scale of Wikimedia Commons? (and dropping the old ones once finished). Is this a concern from a DBA perspective with regards to storage space? (We'd temporarily need about twice the space for these tables). So far I understood that it wouldn't be a problem per se, but that there are also other options we can explore for Wikimedia. For example we could use a separate set of slaves and alter those while depooled (essentially using entirely separate set of db slaves instead of a separate table within each slave).
Do we create the new table(s) separately and switch over once it's caught up? This would require doing multiple passes as we depool slaves one by one (we've done that before at Wikimedia). Switch-over could be done by migrating before the software upgrade, with a very short read-only period after the last pass is finished. It wouldn't require maintaining multiple code paths, which is attractive.
Other ideas?
-- Timo
Mediawiki migration could be done in any form or way; the easiest path for a live migration is to keep the image table; add the logic to ignore older revisions on the same table; add the oldimage entries, which are very few in comparison; then create the non-revision version; migrate the bulk of the code functionality; then drop unnecesary columns. That was discussed at the time, although I am not sure it was all recorded.
Duplicating commons infrastructure is not viable with the hardware we have now.
Modifying the image table is not "free", specially for Commons, but having the image,oldimage pattern is already causing issues to our editors.
On Wed, Aug 10, 2016 at 10:54 PM, Krinkle krinklemail@gmail.com wrote:
TL;DR: Participate on T589 and help decide what the upcoming schema change should entail, and how we'll migrate existing data.
Hey all,
Couple weeks ago we dedicated an IRC office hour to https://phabricator.wikimedia.org/T589 (RFC: image and oldimage tables).
Updated draft at: https://www.mediawiki.org/wiki/Requests_for_comment/image_and_oldimage_table...
We clarified scope and purpose of this particular RFC. Other issues are still important but considered orthogonal, and to be dealt with parallelly (or at a later time).
Revised problem statement:
- File revisions should have unique identifiers (better than "current file
title + upload timestamp". (Subject to race conditions, hard to index/query, etc.) 2. Uploading new file revisions must not involve rows moving across tables, or rows being replaced.
Participants agreed with the revised problem statement, it makes sense not to merely add primary keys to the existing tables ("Proposal 1" on the RFC draft), as that wouldn't adequately solve the Problem 2.
The second proposal was to separate information about image revision from the image entity itself. Similar to the page/revisions tables. This was generally accepted as a good idea, but details are still to be determined.
The general idea is that all revision-specific information (except for a pointer to the current revision) would no longer live in the 'image' table. Instead, information about all (for both current and past revisions) would live in the same table (instead of being moved around from one table to another when it's no longer the current one).
Details at: https://www.mediawiki.org/wiki/Requests_for_comment/image_and_oldimage_table...
Some open questions I'd like to see discussed on Phabricator (or here on wikitech):
- Which fields do we keep in the 'image' table (img_id, img_name,
img_latest, anything else?).
All fields currently being queried from both tables, will probably only stay in the image revision table. But there are a few fields that we intentionally only want to query about current versions. For example 'img_sha1'. For duplicate-detection, we need to only consider the latest revisions. Doing this by keeping img_sha1 means uploading a new revision will involve updating two fields instead of one (img_latest and img_sha1). This isn't unprecedented as we do this for page as well (WikiPage::updateRevisionOn; page_latest, page_touched, page_is_redirect, page_len).
Are there other fields we need to keep besides img_sha1? Or should we can solve the img_sha1 use case in a different manner?
- img_metadata
This field is a blob of serialised PHP (typically representing the Exif data of an image).
Tim (correct me if I got it wrong) mentioned we could potentially make migration easier by changing img_metadata to be stored in a separate table and change the img_metadata field (in the image revision table) to instead be a pointer to a primary key.
This could potentially be done separately later, but if it helps migration, we should consider doing it now.
How will this interact with file deletion? Will it be difficult to garbage collect this? Do we need to? (We don't seem to do it for the 'text' table / external store; is it worth moving this an external store?)
- Migration
If we rename both tables (image/oldimage -> file/filerevision), we'd have the ability to run migration in the background without interfering with the live site, and without requiring a long read-only period and/or duplicate and additional code complexity to be developed.
Is there a way we can do the migration without creating two new tables? Using the oldimage table as import destination for current rows isn't straight forward as existing scan queries would need to skip the current rows somehow while in the midst of this migration. Seems possible, but is it worth the complexity? (We'd need extra code that knows about that migration field, and how long do we keep that code? Also complicates migration for third-parties using update.php).
Is creating the new tables separately viable for the scale of Wikimedia Commons? (and dropping the old ones once finished). Is this a concern from a DBA perspective with regards to storage space? (We'd temporarily need about twice the space for these tables). So far I understood that it wouldn't be a problem per se, but that there are also other options we can explore for Wikimedia. For example we could use a separate set of slaves and alter those while depooled (essentially using entirely separate set of db slaves instead of a separate table within each slave).
Do we create the new table(s) separately and switch over once it's caught up? This would require doing multiple passes as we depool slaves one by one (we've done that before at Wikimedia). Switch-over could be done by migrating before the software upgrade, with a very short read-only period after the last pass is finished. It wouldn't require maintaining multiple code paths, which is attractive.
Other ideas?
-- Timo _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
wikitech-l@lists.wikimedia.org