I've got an early draft of some notes https://www.mediawiki.org/wiki/User:Brion_VIBBER/Compacting_the_revision_table_round_2 for a restructuring of the revision table, to support the following:
* making the revision table itself smaller by breaking large things out * reducing duplicate string storage for content model/format, username/IP address, and edit comments * multi-content revisions ("MCR") - multiple Content blobs of different types on a page, revisioned consistently
There's also some ideas going around about using denormalized summary tables more aggressively, perhaps changing where the indexes used for specific uses live. For instance, a 'contribs' table with just the bits needed for the index lookups for user-contribs, then joined to the other tables.
Initial notes at https://www.mediawiki.org/wiki/User:Brion_VIBBER/Compacting_the_revision_tab... -- I'll be cleaning this up a bit more in response to feedback and concerns.
If we go through with this sort of change, we'll need to carefully consider the upgrade transition. We'll also need to make sure that all relevant queries are updated, and that folks using the databases indirectly (via tool labs, etc) are all able to cleanly handle the new fun stuff. Feedback will be crucial here. :)
Potentially we might split this into a couple transitions instead, or otherwise make major changes to the plan. Nothing's set in stone yet!
-- brion
Whoops I forgot to mention in the list post -- we're planning to talk about this topic in the public ArchCom IRC meeting this Wednesday (21:00 UTC / 2pm PDT).
Already getting good feedback on the page, am updating it, and looking forward to more.... Thanks all. :)
-- brion
On Mon, Feb 13, 2017 at 9:28 AM, Brion Vibber bvibber@wikimedia.org wrote:
I've got an early draft of some notes https://www.mediawiki.org/wiki/User:Brion_VIBBER/Compacting_the_revision_table_round_2 for a restructuring of the revision table, to support the following:
- making the revision table itself smaller by breaking large things out
- reducing duplicate string storage for content model/format, username/IP
address, and edit comments
- multi-content revisions ("MCR") - multiple Content blobs of different
types on a page, revisioned consistently
There's also some ideas going around about using denormalized summary tables more aggressively, perhaps changing where the indexes used for specific uses live. For instance, a 'contribs' table with just the bits needed for the index lookups for user-contribs, then joined to the other tables.
Initial notes at https://www.mediawiki.org/wiki/User:Brion_VIBBER/ Compacting_the_revision_table_round_2 -- I'll be cleaning this up a bit more in response to feedback and concerns.
If we go through with this sort of change, we'll need to carefully consider the upgrade transition. We'll also need to make sure that all relevant queries are updated, and that folks using the databases indirectly (via tool labs, etc) are all able to cleanly handle the new fun stuff. Feedback will be crucial here. :)
Potentially we might split this into a couple transitions instead, or otherwise make major changes to the plan. Nothing's set in stone yet!
-- brion
aaaaand that'll be in #wikimedia-office on irc.freenode.net. :)
-- brion
On Tue, Feb 14, 2017 at 10:38 AM, Brion Vibber bvibber@wikimedia.org wrote:
Whoops I forgot to mention in the list post -- we're planning to talk about this topic in the public ArchCom IRC meeting this Wednesday (21:00 UTC / 2pm PDT).
Already getting good feedback on the page, am updating it, and looking forward to more.... Thanks all. :)
-- brion
On Mon, Feb 13, 2017 at 9:28 AM, Brion Vibber bvibber@wikimedia.org wrote:
I've got an early draft of some notes https://www.mediawiki.org/wiki/User:Brion_VIBBER/Compacting_the_revision_table_round_2 for a restructuring of the revision table, to support the following:
- making the revision table itself smaller by breaking large things out
- reducing duplicate string storage for content model/format, username/IP
address, and edit comments
- multi-content revisions ("MCR") - multiple Content blobs of different
types on a page, revisioned consistently
There's also some ideas going around about using denormalized summary tables more aggressively, perhaps changing where the indexes used for specific uses live. For instance, a 'contribs' table with just the bits needed for the index lookups for user-contribs, then joined to the other tables.
Initial notes at https://www.mediawiki.org/w iki/User:Brion_VIBBER/Compacting_the_revision_table_round_2 -- I'll be cleaning this up a bit more in response to feedback and concerns.
If we go through with this sort of change, we'll need to carefully consider the upgrade transition. We'll also need to make sure that all relevant queries are updated, and that folks using the databases indirectly (via tool labs, etc) are all able to cleanly handle the new fun stuff. Feedback will be crucial here. :)
Potentially we might split this into a couple transitions instead, or otherwise make major changes to the plan. Nothing's set in stone yet!
-- brion
Correction: 22:00 UTC / 2pm PST in #wikimedia-office. Sorry, I calculated with the wrong time by mistake!
-- brion
On Tue, Feb 14, 2017 at 10:38 AM, Brion Vibber bvibber@wikimedia.org wrote:
Whoops I forgot to mention in the list post -- we're planning to talk about this topic in the public ArchCom IRC meeting this Wednesday (21:00 UTC / 2pm PDT).
Already getting good feedback on the page, am updating it, and looking forward to more.... Thanks all. :)
-- brion
On Mon, Feb 13, 2017 at 9:28 AM, Brion Vibber bvibber@wikimedia.org wrote:
I've got an early draft of some notes https://www.mediawiki.org/wiki/User:Brion_VIBBER/Compacting_the_revision_table_round_2 for a restructuring of the revision table, to support the following:
- making the revision table itself smaller by breaking large things out
- reducing duplicate string storage for content model/format, username/IP
address, and edit comments
- multi-content revisions ("MCR") - multiple Content blobs of different
types on a page, revisioned consistently
There's also some ideas going around about using denormalized summary tables more aggressively, perhaps changing where the indexes used for specific uses live. For instance, a 'contribs' table with just the bits needed for the index lookups for user-contribs, then joined to the other tables.
Initial notes at https://www.mediawiki.org/w iki/User:Brion_VIBBER/Compacting_the_revision_table_round_2 -- I'll be cleaning this up a bit more in response to feedback and concerns.
If we go through with this sort of change, we'll need to carefully consider the upgrade transition. We'll also need to make sure that all relevant queries are updated, and that folks using the databases indirectly (via tool labs, etc) are all able to cleanly handle the new fun stuff. Feedback will be crucial here. :)
Potentially we might split this into a couple transitions instead, or otherwise make major changes to the plan. Nothing's set in stone yet!
-- brion
Great feedback everybody -- I'll make more updates and we'll circle back for another discussion in a week or two!
Meeting summary (full logs linked from there): https://tools.wmflabs.org/meetbot/wikimedia-office/2017/wikimedia-office.201...
-- brion
On Wed, Feb 15, 2017 at 9:06 PM, Brion Vibber bvibber@wikimedia.org wrote:
Correction: 22:00 UTC / 2pm PST in #wikimedia-office. Sorry, I calculated with the wrong time by mistake!
-- brion
On Tue, Feb 14, 2017 at 10:38 AM, Brion Vibber bvibber@wikimedia.org wrote:
Whoops I forgot to mention in the list post -- we're planning to talk about this topic in the public ArchCom IRC meeting this Wednesday (21:00 UTC / 2pm PDT).
Already getting good feedback on the page, am updating it, and looking forward to more.... Thanks all. :)
-- brion
On Mon, Feb 13, 2017 at 9:28 AM, Brion Vibber bvibber@wikimedia.org wrote:
I've got an early draft of some notes https://www.mediawiki.org/wiki/User:Brion_VIBBER/Compacting_the_revision_table_round_2 for a restructuring of the revision table, to support the following:
- making the revision table itself smaller by breaking large things out
- reducing duplicate string storage for content model/format,
username/IP address, and edit comments
- multi-content revisions ("MCR") - multiple Content blobs of different
types on a page, revisioned consistently
There's also some ideas going around about using denormalized summary tables more aggressively, perhaps changing where the indexes used for specific uses live. For instance, a 'contribs' table with just the bits needed for the index lookups for user-contribs, then joined to the other tables.
Initial notes at https://www.mediawiki.org/w iki/User:Brion_VIBBER/Compacting_the_revision_table_round_2 -- I'll be cleaning this up a bit more in response to feedback and concerns.
If we go through with this sort of change, we'll need to carefully consider the upgrade transition. We'll also need to make sure that all relevant queries are updated, and that folks using the databases indirectly (via tool labs, etc) are all able to cleanly handle the new fun stuff. Feedback will be crucial here. :)
Potentially we might split this into a couple transitions instead, or otherwise make major changes to the plan. Nothing's set in stone yet!
-- brion
On Wed, Feb 15, 2017 at 3:06 PM, Brion Vibber bvibber@wikimedia.org wrote:
Great feedback everybody -- I'll make more updates and we'll circle back for another discussion in a week or two!
Meeting summary (full logs linked from there): https://tools.wmflabs.org/meetbot/wikimedia-office/2017/ wikimedia-office.2017-02-15-22.01.html
We're going to have another checkin during ArchCom IRC meeting time this Wednesday, 22:00 UTC / 2pm PST in #wikimedia-office
Documents will be updated shortly reflecting the previous discussion & ongoing tweaks.
Open questions include: * should we go straight to the MCR-ready schema or do this in two steps, one to break up tables & prep, and another for the MCR content model? * final model for updating archive & text
-- brion
Brion Vibber wrote:
We're going to have another checkin during ArchCom IRC meeting time this Wednesday, 22:00 UTC / 2pm PST in #wikimedia-office
Documents will be updated shortly reflecting the previous discussion & ongoing tweaks.
Open questions include:
- should we go straight to the MCR-ready schema or do this in two steps,
one to break up tables & prep, and another for the MCR content model?
- final model for updating archive & text
Re: https://www.mediawiki.org/wiki/?curid=661038
The implementation path isn't clear to me. For a "regular" MediaWiki installation, will making these changes be a matter of simply updating MediaWiki's application code and running maintenance/update.php?
For Wikimedia wikis, as far as I know update.php is never run. Are you planning to write separate maintenance scripts for this?
Regarding scope, this is a lot of changes. How are all of these changes intended to be divided? Are we able to move forward with some changes (e.g., adding a comment table) without moving forward with other changes (e.g., adding a user_entry table)? Some parts of this proposal seem to be well-received and popular (yay). Other parts, particularly dealing with users, seem to be hairier and less settled.
MZMcBride
On Mon, 6 Mar 2017 at 16:52 MZMcBride z@mzmcbride.com wrote:
For a "regular" MediaWiki installation, will making these changes be a matter of simply updating MediaWiki's application code and running maintenance/update.php?
Yes.
For Wikimedia wikis, as far as I know update.php is never run.
Correct; it'd take down the cluster.
Are you planning to write separate maintenance scripts for this?
Yes.
As is "normal" with schema changes, in Wikimedia production this will be done manually by the DBAs https://wikitech.wikimedia.org/wiki/Schema_changes. It is a careful, very slow process that manages the otherwise-impossible. It will take months of their time, is seriously laborious, and blocks any other such changes. A recent user-facing example is T69223 https://phabricator.wikimedia.org/T69223, which was required to support translation from non-English languages on multi-content wikis. This is why the DBAs' views are so important. :-)
Once the schema change is done, we may/will back-fill old rows to populate the new schema, using maintenance scripts for each wiki. However, given that the table we're talking about is revision with over three quarters of a billion rows on enwiki alone, that will be exceptionally slow-running.
Once all *that* is done, we could do a further schema change to drop the old bits of the schema that are no longer used (again, slow), and then drop the backwards-compatible database code from MediaWiki. But that's optional.
Regarding scope, this is a lot of changes. How are all of these changes
intended to be divided? Are we able to move forward with some changes (e.g., adding a comment table) without moving forward with other changes (e.g., adding a user_entry table)?
Yes, but given that this round will take years to complete, deciding to delay some of the things means upsetting a lot of plans.
J.
Summary from March 8 irc meeting: https://tools.wmflabs.org/meetbot/wikimedia-office/2017/wikimedia-office.201...
-- brion
On Mon, Mar 6, 2017 at 9:43 AM, Brion Vibber bvibber@wikimedia.org wrote:
On Wed, Feb 15, 2017 at 3:06 PM, Brion Vibber bvibber@wikimedia.org wrote:
Great feedback everybody -- I'll make more updates and we'll circle back for another discussion in a week or two!
Meeting summary (full logs linked from there): https://tools.wmflabs.org/meetbot/wikimedia-office/2017/wiki media-office.2017-02-15-22.01.html
We're going to have another checkin during ArchCom IRC meeting time this Wednesday, 22:00 UTC / 2pm PST in #wikimedia-office
Documents will be updated shortly reflecting the previous discussion & ongoing tweaks.
Open questions include:
- should we go straight to the MCR-ready schema or do this in two steps,
one to break up tables & prep, and another for the MCR content model?
- final model for updating archive & text
-- brion
wikitech-l@lists.wikimedia.org