On Tue, Aug 11, 2015 at 12:29 PM, Jon Katz <jkatz(a)wikimedia.org> wrote:
However, it seems that >90% of the clicks are
coming from the article
table (or adding search created bloat) and
MobileWebUIClickTracking_10742159 is now approaching 300gb. Mostly this is
due to search. I would encourage further sampling, but that would mean that
beta data would be lost. Perhaps we can split it into separate beta/stable
tables and then sample stable? Any other ideas?
Add a samplingRatio field to the schema, add a PHP global to control
sampling ratio, set it via operations/mediawiki-config appropriately for
each site, in the SQL query used for the dashboards replace count(*) with
sum(event_samplingRatio). We did that for MediaViewer and it worked great.
Also if your main concern is table size (for us it was mainly server load),
you can just run a script periodically to replace the user agent and the
URL with an empty string. Those probably take up most of the storage space,
every other field is fairly short.