On Dec 5, 2012, at 12:43 PM, Patrick Reilly <preilly(a)wikimedia.org> wrote:
Fellow Wikimedia Developers,
Matthias Mullie has been working hard to refactor the backend of
mediawiki/extensions/ArticleFeedbackv5 to add proper sharding support.
The original approach that he took was to rely on RDBStore that was
first introduced in Change-Id:
Ic1e38db3d325d52ded6d2596af2b6bd3e9b870fe
https://gerrit.wikimedia.org/r/#/c/16696 by Aaron Schulz.
Asher Feldman, Tim Starling and myself reviewed the new class RDBStore
and determined that it wasn't really the best approach for our current
technical architecture and database environment. Aaron Schulz had a
lot of really good ideas included in RDBStore, but it just seemed like
it wasn't a great fit right now. We decided collectively to abandon
the RDBStore work permanently at this time.
:-( I'm going through all the stages of grief right now. In a few moments, I'll
hit "acceptance"
So, we're now left with the need to provide
Matthias Mullie with some
direction on what is the best solution for the ArticleFeedbackv5
refactor.
One possible solution would be to create a new database cluster for
this type of data. This cluster would be solely for data that is
similar to Article Feedback's and that has the potential of being
spammy in nature. The MediaWiki database abstraction layer could be
used directly via a call to the wfGetDB() function to retrieve a
Database object. A read limitation with this approach will be
particularly evident when we require a complex join. We will need to
eliminate any cross-shard joins.
This seems like the only reasonable solution that can be done in a timely manner at the
moment.
I caution against making this sort of vertical partitioning a long-term solution.
Knowledge of what data lives on what machine is the sort of human knowledge created by
heterogeneous systems that is prone to failure. All "Social" style data tends to
proliferate fast and ArticleFeedback is just one such piece. Tying ourselves to a vertical
partition that grows at Moore's law is going to bite us when we hit things like Flow
(messages), LQT3, etc.
Cross-shard JOINs are already eliminated (or should be) in AFTv5 patch since it assumes
RDB, so there shouldn't be a call in the code that requires a wfGetDB() to return the
same database object for AFT-related tables and non-AFT-related ones.
The reality is that Database Sharding is a very useful
technology, but
like other approaches, there are many factors to consider that ensure
a successful implementation. Further, there are some limitations and
Database Sharding will not work well for every type of application.
Most of this is alleviated with increased dependence on memcache for caching intermediate
values and rollups. Since this isn't handled on the object level in mediawiki, I
assumed this is a problem for the AFTv5 patch and not RDB store.
So, to this point when we truly implement sharding in
the future it
will more than likely be benificial to focus on place in core
mediawiki where it will have the greatest impact, such as the
pagelinks and revision tables.
Yes, it'd have the greatest impact here, but these are tables with some of the most
indexes/rollups/joins.
To do this properly we'd need a core object baseclass that would allow us to use
BagOfStuff/memcached as a passthrough on anything going to the shard so that joins are
done in PHP and reads only from memcache (unless the data requested has been LRU'd).
That's a much larger undertaking.
Take care,
terry