+1 to Dario's mention of the many schemas that just capture production DB
stuff in a better way.
Re. growth: Old growth experiment schemas continue to be a great resource
for checking old work and sometimes even new hypotheses. When Dario and
Kevin get around to us, I'll have a complete list of schemas that should
not be purged.
Re. storage parameters in the Schema, I agree with Ori, but I'd still like
to have them on the wiki somehow. If we were a bunch of Wikipedia editors,
I'd suggest making a template for the talk page of a schema that captures
this metadata. Given that a template would probably not be best and we'd
probably like to stick to JSON, maybe a subpage would be in order.
- Schema:Foo == data type JSON
- Schema:Foo/restrictions == storage restrictions JSON (sampling,
pruning, indexing, etc.)
- Schema_talk:Foo == Discussion of Schema:Foo
Such a pattern would allow for changes to storage restrictions without
changing the rev_id of the schema page (data type).
On Thu, May 29, 2014 at 1:26 AM, Steven Walling <swalling(a)wikimedia.org>
On Wed, May 28, 2014 at 10:50 AM, Dan Andreescu <dandreescu(a)wikimedia.org>
I just announced this potential change in Scrum
of Scrums and the Mobile
team said they also would like to keep old data, but not for all of their
schemas. They're cleaning up their graphs and we should check with them
when we start deleting.
Following up on this from the Growth perspective...
My main question is what the rationale is. Is it to improve query
performance on analytics dbs?
I do know there are many older schemas for Growth-related experiments that
are only really useful for historical analysis, which is kind of hard to
reconstruct anyway. If there are sound technical reasons to chuck stuff
from the relational dbs and retain it only in the raw JSON logs, then I'm
potentially okay with helping figure out a list of schemas to retain and
schemas to purge. Aaron, thoughts?
Analytics mailing list