On Nov 15, 2012, at 2:51 PM, MZMcBride <z(a)mzmcbride.com> wrote:
Max Semenik wrote:
> On 15.11.2012, 4:06 Diederik wrote:
>> I think that the Analytics team would prefer either:
>> 1) detect source of edit in the URL
>> Or
>> 2) have a hook activated after a successful edit and have the data send to
>> the pixel service
>
>> Having this data in a MySQL table poses a lot of challenges with
>> respect of importing that data into the analytics cluster
>
> That's for analytics purposes. However, there can be other use cases
> for which tags in the DB are perfect, for example filter recent
> changes for edits made only via a particular channel.
Max is right.
The general issue is that the revision table could use a generalized metadata store the
same way that page table has page_props[1]. This is not the same, but sometimes coincident
to analytical needs (I assume that if we come up with a way to attach revision-based
metadata, it would be easy to expose that same data to the analytics pipeline for
RevTagging).
To Amir's original suggestion, I think that hacking in a rev_mobile field into the
rev table sounds extremely clunky. I'd be worried that over time this will end up with
an explosion that will resemble our Recentchanges table[2]. I assume that's why Amir
brought this up to suggestion.
On the other hand, a "revision_props" the way page_props would be a terrible
waste of space and performance—imagine storing a boolean in a BLOB with every revision?
Perhaps we could use a small varchar or smallint in place of the BLOB and not too high
impact but fliexible enough to handle both existing (key: mobile_edit; value:1) and future
needs? Especially if mobile_edit=0 isn't actually stored as an entry at all.
Right, which is why a revision tagging system exists
in MediaWiki core
currently. If someone wanted to, for example, modify the MobileFrontend
extension to add a "mobile" tag to edits, it would be trivial to do. The
tagging infrastructure is already in place.
It's unfortunate that RevTagging got mixed in this discussion, but I hope this
clarifies the distinction between mobile's needs and RevTagging.
Currently, MW has a very limited ability to attach metadata revision table to the
revision table in the form of new cols to the revision table (existing cols are…
limited[3]) The issue is that this data is prioritized for transactional use, not
necessary analytical use (in wiki[4]: "is needed to operate the website and, in
particular, to populate article revision histories").
In analytical systems, data is fed down a different pipeline in order to be
"online" and have no impact to the web transactions. Naïvely, that's because
analytical questions on transactional databases look like "COUNT * FROM
sometable" which are full table scans (or thereabouts) and are expensive. Adding the
metadata for analytical purposes based on the OLTP store would then be "COUNT * FROM
sometable GROUP BY datafromothertable JOIN awholemessoftables" which are multiple
full table scans, and pretty soon that is would require a dedicated offline read-only DB,
and still be terribly slow.
So there is a need to attach metadata needed for analytics (which may or may not be the
same metadata "needed to operate the website") at runtime so that it can be run
down the analytical data pipeline without needing to hit the live OLTP store continually
asking things like "give me the campaign that this revision occurred under?"
especially when things like "campaign" probably have no importance at all to the
website itself.
My thinking that if we had a way of attaching arbitrary meta to revisions, then, in cases
where the two needs are coincident, all we have to do is expose that same meta to
analytics through their pixel service (revtagging) and we're good to go. If revtagging
isn't up, or hasn't recorded it, we could still go back to the transactional store
offline and amortize the missing information.
Going back to the broader point, I'm completely
lost as to why the Analytics
team can't handle a structured database.
I assume this last is a bit tongue-in-cheek, but I LOL'd… for completely different
reasons.
[1]:
http://www.mediawiki.org/wiki/Manual:Page_props_table
[2]:
http://www.mediawiki.org/wiki/Manual:Recentchanges_table
[3]:
http://www.mediawiki.org/wiki/Manual:Revision_table
[4]:
http://www.mediawiki.org/wiki/Revtagging
terry chay 최태리
Director of Features Engineering
Wikimedia Foundation
“Imagine a world in which every single human being can freely share in the sum of all
knowledge. That's our commitment.”
p: +1 (415) 839-6885 x6832
m: +1 (408) 480-8902
e: tchay(a)wikimedia.org
i:
http://terrychay.com/
w:
http://meta.wikimedia.org/wiki/User:Tychay
aim: terrychay