Fellow Wikimedia Developers,
Matthias Mullie has been working hard to refactor the backend of mediawiki/extensions/ArticleFeedbackv5 to add proper sharding support.
The original approach that he took was to rely on RDBStore that was first introduced in Change-Id: Ic1e38db3d325d52ded6d2596af2b6bd3e9b870fe https://gerrit.wikimedia.org/r/#/c/16696 by Aaron Schulz.
Asher Feldman, Tim Starling and myself reviewed the new class RDBStore and determined that it wasn't really the best approach for our current technical architecture and database environment. Aaron Schulz had a lot of really good ideas included in RDBStore, but it just seemed like it wasn't a great fit right now. We decided collectively to abandon the RDBStore work permanently at this time.
So, we're now left with the need to provide Matthias Mullie with some direction on what is the best solution for the ArticleFeedbackv5 refactor.
One possible solution would be to create a new database cluster for this type of data. This cluster would be solely for data that is similar to Article Feedback's and that has the potential of being spammy in nature. The MediaWiki database abstraction layer could be used directly via a call to the wfGetDB() function to retrieve a Database object. A read limitation with this approach will be particularly evident when we require a complex join. We will need to eliminate any cross-shard joins.
The reality is that Database Sharding is a very useful technology, but like other approaches, there are many factors to consider that ensure a successful implementation. Further, there are some limitations and Database Sharding will not work well for every type of application.
So, to this point when we truly implement sharding in the future it will more than likely be benificial to focus on place in core mediawiki where it will have the greatest impact, such as the pagelinks and revision tables.
— Patrick
I'm seconding that recommendation to be clear. More specifically, I'd suggest that the AFT classes have two new protected methods: * getSlaveDB() - wrapper for wfGetLBFactory()->getExternalLB( $wgArticleFeedBackCluster )->getConnection( DB_SLAVE, array(), $wikiId ) * getMasterDB() - wrapper for wfGetLBFactory()->getExternalLB( $wgArticleFeedBackCluster )->getConnection( DB_MASTER, array(), $wikiId ) The wrappers could also handle the case where the cluster is the usual wiki cluster as well (e.g. good old wfGetDB()).
You could then swap out the current wfGetDB() calls with these methods. It might be easiest to start with the current AFT, do this, and fix up the excessive queries write queries rather that try to convert the AFT5 code that used sharding. The name of the cluster would be an AFT configuration variable (e.g. $wgArticleFeedBackCluster = 'external-aft' ).
This works by adding the new 'external-aft' cluster to the 'externalLoads' portion of the load balancer configuration. It may make sense to give the cluster a non-AFT specific name though (like 'external-1'), since I assume other extensions would use it. Maybe the clusters could be named after philosophers to be more interesting...
One could instead use wfGetDB( index, array(), 'extension-aft' ), though this would be a bit hack since: a) A wiki ID would be used as an external cluster name where there is no wiki b) The actual wiki IDs would have to go into table names or a column
-- View this message in context: http://wikimedia.7.n6.nabble.com/Refactor-of-mediawiki-extensions-ArticleFee... Sent from the Wikipedia Developers mailing list archive at Nabble.com.
On Dec 5, 2012, at 12:43 PM, Patrick Reilly preilly@wikimedia.org wrote:
Fellow Wikimedia Developers,
Matthias Mullie has been working hard to refactor the backend of mediawiki/extensions/ArticleFeedbackv5 to add proper sharding support.
The original approach that he took was to rely on RDBStore that was first introduced in Change-Id: Ic1e38db3d325d52ded6d2596af2b6bd3e9b870fe https://gerrit.wikimedia.org/r/#/c/16696 by Aaron Schulz.
Asher Feldman, Tim Starling and myself reviewed the new class RDBStore and determined that it wasn't really the best approach for our current technical architecture and database environment. Aaron Schulz had a lot of really good ideas included in RDBStore, but it just seemed like it wasn't a great fit right now. We decided collectively to abandon the RDBStore work permanently at this time.
:-( I'm going through all the stages of grief right now. In a few moments, I'll hit "acceptance"
So, we're now left with the need to provide Matthias Mullie with some direction on what is the best solution for the ArticleFeedbackv5 refactor.
One possible solution would be to create a new database cluster for this type of data. This cluster would be solely for data that is similar to Article Feedback's and that has the potential of being spammy in nature. The MediaWiki database abstraction layer could be used directly via a call to the wfGetDB() function to retrieve a Database object. A read limitation with this approach will be particularly evident when we require a complex join. We will need to eliminate any cross-shard joins.
This seems like the only reasonable solution that can be done in a timely manner at the moment.
I caution against making this sort of vertical partitioning a long-term solution. Knowledge of what data lives on what machine is the sort of human knowledge created by heterogeneous systems that is prone to failure. All "Social" style data tends to proliferate fast and ArticleFeedback is just one such piece. Tying ourselves to a vertical partition that grows at Moore's law is going to bite us when we hit things like Flow (messages), LQT3, etc.
Cross-shard JOINs are already eliminated (or should be) in AFTv5 patch since it assumes RDB, so there shouldn't be a call in the code that requires a wfGetDB() to return the same database object for AFT-related tables and non-AFT-related ones.
The reality is that Database Sharding is a very useful technology, but like other approaches, there are many factors to consider that ensure a successful implementation. Further, there are some limitations and Database Sharding will not work well for every type of application.
Most of this is alleviated with increased dependence on memcache for caching intermediate values and rollups. Since this isn't handled on the object level in mediawiki, I assumed this is a problem for the AFTv5 patch and not RDB store.
So, to this point when we truly implement sharding in the future it will more than likely be benificial to focus on place in core mediawiki where it will have the greatest impact, such as the pagelinks and revision tables.
Yes, it'd have the greatest impact here, but these are tables with some of the most indexes/rollups/joins.
To do this properly we'd need a core object baseclass that would allow us to use BagOfStuff/memcached as a passthrough on anything going to the shard so that joins are done in PHP and reads only from memcache (unless the data requested has been LRU'd).
That's a much larger undertaking.
Take care,
terry
wikitech-l@lists.wikimedia.org