Within the Flow extension we have a need for inserting our own special changes into the recentchanges table so that Watchlists continue to inform users of changes in the same ways they are used to. Within mediawiki the WikiData extension has similar requirements and has implemented a solution that works for their use case. Flow is looking to extend this to handle multiple types of external change sources. The solution taken by WikiData to render the lines works well and will be used by Flow, but we have some concerns regarding how different types of external changes will be filtered by the queries that generate the Special:RecentChanges and Special:Watchlist pages.
How does the current solution work?
There is a field in the recentchanges table, rc_type. All WikiData entries use the value of RC_EXTERNAL( = 5) for this field. Queries are generated with either (rc_type = 5) or (rc_type != 5) when filtering is required.
Requirements:
- Currently WikiData entries into recentchanges are filtered from Special:RecentChanges and Special:Watchlist. This is toggleable. By default we will not want to filter Flow entries, but will want to offer a toggle much like WikiData does. - More types of external change sources should be able to add themselves in the future without core changes - We should play nice with the db slave's serving up watchlists.
There are a couple options, each with their own tradeoffs.
1. Use rc_type = RC_EXTERNAL and add a new field to the recentchanges table, rc_external_type. This would be a varchar(16) field. Wikidata and Flow would put their respective names in the field to distinguish between each other. This is conceptually simple, but makes the queries look even odder. (rc_type != 5) becomes (rc_type != 5 AND rc_external_type != 'wikidata'). 2. Similar to 1, but instead of creating a new field reuse rc_log_type field which is only used when rc_type = RC_LOG. This seems a bit hacky, but would only need a field rename to not feel so hacky. I'm not proposing to rename the field though as there are a variety of extensions depending on the current field name and we are not going to coordinate getting them all updated at the exact same time. The fact that this field is used by various extensions may be a hint that we shouldn't reuse it. 3. Replace RC_EXTERNAL with RC_WIKIDATA and RC_FLOW constants in their respective extensions. This is also straightforward, but adds development overhead to ensure future creators of RC_* constants do not conflict with each other. It would be handled similarly to NS_* constants with an on-wiki list. I have heard some mention that naming conflicts have occurred in the past with this solution. This would force queries looking for only core sources of change to provide an inclusive list of RC_* values to find, rather than using rc_type != RC_EXTERNAL.
Things to consider: On smaller wiki's WikiData changes can account for > 50% of the changes. Talk namespace edits, which we expect to eventually replace with flow edits, account for ~20% of enwiki recentchanges rows
The standard query issued by Special:RecentChanges is
SELECT /* lots of fields */ FROM `recentchanges` FORCE INDEX (rc_timestamp) LEFT JOIN `watchlist` ON (wl_user = '2' AND (wl_title=rc_title) AND (wl_namespace=rc_namespace)) LEFT JOIN `tag_summary` ON ((ts_rc_id=rc_id)) WHERE (rc_timestamp >= '20130912000000') AND rc_bot = '0' AND (rc_type != 5) ORDER BY rc_timestamp DESC LIMIT 50
The standard query issued by Special:Watchlist is
SELECT /* lots of fields */ FROM `recentchanges` INNER JOIN `watchlist` ON (wl_user = '2' AND (wl_namespace=rc_namespace) AND (wl_title=rc_title)) LEFT JOIN `page` ON ((rc_cur_id=page_id)) LEFT JOIN `tag_summary` ON ((ts_rc_id=rc_id)) WHERE (rc_timestamp > '20130916175626') AND (rc_this_oldid=page_latest OR rc_type=3) AND (rc_type != 5) ORDER BY rc_timestamp DESC
Without further input I will be implementing option 3 from above, I welcome any input on better solutions, or potential pitfalls with this solution.
Erik Bernhardson
Does any of the 3 options avoid the same problem as https://bugzilla.wikimedia.org/show_bug.cgi?id=44874 from hitting us? users can ignore Wikidata changes in turn of efficiency (enhanced RC), but I understand you don't want them to ignore Flow.
Nemo
Unfortunatly no, none of this has anything to do specifically with fixing the spaghetti that is the enhanced changes format. I have not loofked deeply into the problem, but the comments from the WikiData developers that have looked into it suggest it is a non-trivial change. The change proposed above is very trivial from an implementation perspective, but it affects one of the most used tables in mediawiki and the developers I've spoken with have different opinions on which way is the best way to go. I wanted to give those I have not talked to directly an opportunity to be heard before we change anything.
Erik Bernhardson
On Thu, Sep 19, 2013 at 4:11 PM, Federico Leva (Nemo) nemowiki@gmail.comwrote:
Does any of the 3 options avoid the same problem as https://bugzilla.wikimedia.**org/show_bug.cgi?id=44874https://bugzilla.wikimedia.org/show_bug.cgi?id=44874from hitting us? users can ignore Wikidata changes in turn of efficiency (enhanced RC), but I understand you don't want them to ignore Flow.
Nemo
______________________________**_________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/**mailman/listinfo/wikitech-lhttps://lists.wikimedia.org/mailman/listinfo/wikitech-l
On Thu, Sep 19, 2013 at 11:45 AM, Erik Bernhardson < ebernhardson@wikimedia.org> wrote:
- Replace RC_EXTERNAL with RC_WIKIDATA and RC_FLOW constants in their
respective extensions. This is also straightforward, but adds development overhead to ensure future creators of RC_* constants do not conflict with each other. It would be handled similarly to NS_* constants with an on-wiki list. I have heard some mention that naming conflicts have occurred in the past with this solution. This would force queries looking for only core sources of change to provide an inclusive list of RC_* values to find, rather than using rc_type != RC_EXTERNAL.
Please don't repeat the mistake of having extension authors actually caring what their namespace number is. Everyone just goes "Oh, nobody's probably using 200 so I'll just do that."
-Chad
~Daniel Friesen (Dantman, Nadir-Seen-Fire) [http://danielfriesen.name/]
On 2013-09-19 4:44 PM, Chad wrote:
On Thu, Sep 19, 2013 at 11:45 AM, Erik Bernhardson < ebernhardson@wikimedia.org> wrote:
- Replace RC_EXTERNAL with RC_WIKIDATA and RC_FLOW constants in their
respective extensions. This is also straightforward, but adds development overhead to ensure future creators of RC_* constants do not conflict with each other. It would be handled similarly to NS_* constants with an on-wiki list. I have heard some mention that naming conflicts have occurred in the past with this solution. This would force queries looking for only core sources of change to provide an inclusive list of RC_* values to find, rather than using rc_type != RC_EXTERNAL.
Please don't repeat the mistake of having extension authors actually caring what their namespace number is. Everyone just goes "Oh, nobody's probably using 200 so I'll just do that."
-Chad
+1
@Eric The on-wiki list you talk about is here: https://www.mediawiki.org/wiki/Extension_default_namespaces
"I have heard some mention that naming conflicts have occurred in the past with this solution." Yes there are plenty. 120-121 is used by both RefHelper and Rich Media 200-203 is used by SocialProfile and Data Import 300-301 is used by PollNY and Access Control List Wikia also uses 300-399 when writing it's own extensions and doesn't bother co-operating by at least adding the defaults they use to that list to avoid conflicts. 500-501 is used by BlogPage and Linked Data 700-701 is used by LinkFilter and Collaboration BlueSpice and BlogPage have a different type of conflict too. They BOTH use the constant NS_BLOG and define different namespace defaults for it.
This on-wiki page is ONLY a registry of defaults. The standard practice for these is that the starting number should be configurable so namespace ids other than the default can be used to avoid conflicts. I'm not so sure you'll be able to to that very well for RC external ids.
Anyways, this whole extension namespace id setup is considered a bug. You don't want to get into this situation again. We have an open bug on dropping this default namespace nonsense and using dynamic registration of namespace IDs https://bugzilla.wikimedia.org/show_bug.cgi?id=31063
I will take a look over the bug, quite a long conversation. It will take me the night most likely to digest the suggestions included. I suppose my first worry is that I was targeting simple changes which can be agree'd on and implemented in a few lines, whereas the linked bug report seems to suggest a system that I know will require many iterations and weeks of on/off work before +2'd into core.
Erik Bernhardson
On Thu, Sep 19, 2013 at 5:07 PM, Daniel Friesen daniel@nadir-seen-fire.comwrote:
~Daniel Friesen (Dantman, Nadir-Seen-Fire) [http://danielfriesen.name/]
On 2013-09-19 4:44 PM, Chad wrote:
On Thu, Sep 19, 2013 at 11:45 AM, Erik Bernhardson < ebernhardson@wikimedia.org> wrote:
- Replace RC_EXTERNAL with RC_WIKIDATA and RC_FLOW constants in
their
respective extensions. This is also straightforward, but adds development overhead to ensure future creators of RC_* constants do not conflict with each other. It would be handled similarly to NS_* constants with an on-wiki list. I have heard some mention that naming conflicts have occurred in the past with this solution. This would force queries looking for only core sources of change to provide an inclusive list of RC_* values to find, rather than using rc_type != RC_EXTERNAL.
Please don't repeat the mistake of having extension authors actually
caring
what their namespace number is. Everyone just goes "Oh, nobody's probably using 200 so I'll just do that."
-Chad
+1
@Eric The on-wiki list you talk about is here: https://www.mediawiki.org/wiki/Extension_default_namespaces
"I have heard some mention that naming conflicts have occurred in the past with this solution." Yes there are plenty. 120-121 is used by both RefHelper and Rich Media 200-203 is used by SocialProfile and Data Import 300-301 is used by PollNY and Access Control List Wikia also uses 300-399 when writing it's own extensions and doesn't bother co-operating by at least adding the defaults they use to that list to avoid conflicts. 500-501 is used by BlogPage and Linked Data 700-701 is used by LinkFilter and Collaboration BlueSpice and BlogPage have a different type of conflict too. They BOTH use the constant NS_BLOG and define different namespace defaults for it.
This on-wiki page is ONLY a registry of defaults. The standard practice for these is that the starting number should be configurable so namespace ids other than the default can be used to avoid conflicts. I'm not so sure you'll be able to to that very well for RC external ids.
Anyways, this whole extension namespace id setup is considered a bug. You don't want to get into this situation again. We have an open bug on dropping this default namespace nonsense and using dynamic registration of namespace IDs https://bugzilla.wikimedia.org/show_bug.cgi?id=31063
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
You can trivially avoid the need to do anything as complex as dynamic namespace registration by simply using one of your other options like using the string 'wikidata' or 'flow' rather than a constant and integer id. If you want integer ids that badly you could always create a new rc_external_types (or whatever you want to call it) mapping an auto_increment id to keys like 'wikidata' and 'flow' and use the primary key there as the rc_external_type.
Long story short. Hardcoding integer numbers into extensions hoping you're not going to conflict with other extensions is never a good idea. You're just subjecting yourself to future pain you could have avoided at the start with a simple solution.
~Daniel Friesen (Dantman, Nadir-Seen-Fire) [http://danielfriesen.name/]
On 2013-09-19 5:41 PM, Erik Bernhardson wrote:
I will take a look over the bug, quite a long conversation. It will take me the night most likely to digest the suggestions included. I suppose my first worry is that I was targeting simple changes which can be agree'd on and implemented in a few lines, whereas the linked bug report seems to suggest a system that I know will require many iterations and weeks of on/off work before +2'd into core.
Erik Bernhardson
t
strings vs integers is not something I'm too worried about. A varchar(16) is essentially 128 bits and with all the things a DB has to do i'm not too worried about comparing a 32bit int vs a 128bit string. I do feel that adding yet another field to compare to is just adding to the complexity of the existing solution while not having a great benefit. After some discussion with the team we have come up with another possible solution(which is much more work, but perhaps worthwhile in the long run).
We are proposing to deprecate the existing rc_type field in the recentchanges table in favor of a new string field rc_source. We would modify the existing core (and extension as necessary) code to start inserting into this new field. Grep's through mediawiki and its extensions suggest there are only a handful of places that would need to be transitioned, some work but not impossible.
DB changes:
ALTER TABLE recentchanges ADD rc_source varchar(16) binary not null;
Changes to constants used:
RC_NEW becomes RC_SRC_NEW RC_EDIT becomes RC_SRC_EDIT etc.
define( 'RC_SRC_NEW', 'mw.new' ); define( 'RC_SRC_EDIT', 'mw.edit' ); etc.
Extensions can create their own constants which avoid most coordination that the previous proposal would have needed:
define ( 'RC_SRC_FLOW', 'flow.something' );
I'm not certain if the db servers would like us going back through time and updating all the recentchanges rows on the various wiki's, most prudent would be for us to get all the insertion points populating the new tables and wait a month for all the old data to be deleted from recentchanges. At that point we will mark the existing uses of rc_type as deprecated and start transitioning queries to the new field. I have not previously done this sort of transition in core so no comment yet on how long we would be populating both fields in the database.
Is this approach more reasonable? Caveats?
Erik Bernhardson
On Thu, Sep 19, 2013 at 5:49 PM, Daniel Friesen daniel@nadir-seen-fire.comwrote:
You can trivially avoid the need to do anything as complex as dynamic namespace registration by simply using one of your other options like using the string 'wikidata' or 'flow' rather than a constant and integer id. If you want integer ids that badly you could always create a new rc_external_types (or whatever you want to call it) mapping an auto_increment id to keys like 'wikidata' and 'flow' and use the primary key there as the rc_external_type.
Long story short. Hardcoding integer numbers into extensions hoping you're not going to conflict with other extensions is never a good idea. You're just subjecting yourself to future pain you could have avoided at the start with a simple solution.
~Daniel Friesen (Dantman, Nadir-Seen-Fire) [http://danielfriesen.name/]
On 2013-09-19 5:41 PM, Erik Bernhardson wrote:
I will take a look over the bug, quite a long conversation. It will take me the night most likely to digest the suggestions included. I suppose
my
first worry is that I was targeting simple changes which can be agree'd
on
and implemented in a few lines, whereas the linked bug report seems to suggest a system that I know will require many iterations and weeks of on/off work before +2'd into core.
Erik Bernhardson
On Mon, Sep 23, 2013 at 2:38 PM, Erik Bernhardson ebernhardson@wikimedia.org wrote:
I'm not certain if the db servers would like us going back through time and updating all the recentchanges rows on the various wiki's,
Although it probably won't be used on WMF wikis, you'll still need to provide something of the sort so non-WMF wikis don't wind up with broken recentchanges tables for a month when they upgrade to MediaWiki 1.22 when that is officially released.
wikitech-l@lists.wikimedia.org