Hi everyone,
A quick note about something that just messed me up. When uploading a
cohort to wikimetrics, you are told you can use either user_name, user_id,
or a mixture in the first column. However, this can really produce
unexpected results if you don't know how it works. I think it needs to
change, but until then, this is how it works and how it can bite you:
Let's say I have a list of users:
1,en
2,en
3,en
When it validates, it will look up user_name == 1, if it doesn't find
anything it will look up user_id == 1. Then user_name == 2, user_id == 2,
user_name == 3, user_id == 3. If what you meant with the above cohort was
the users with ids 1, 2, and 3, then you might be very confused later when
you see user id 234215 in your output results. This might happen if a
user_name is actually 2! So, for now, until I figure out how to fix this,
it will always prefer user_names before user_ids.
Please let me know if this is confusing. Also, the whole problem stems
from needing to accept both user_id and user_name in the *same* upload. If
everyone agrees, I'd much rather just allow people to toggle between one or
the other. This would speed up validation and make it much clearer what is
going on.
Dear Wikimetrics users,
I've just deployed asynchronous cohort upload. This is feature #818:
https://mingle.corp.wikimedia.org/projects/analytics/cards/818 and
basically allows you to upload larger cohorts because validation is
happening behind the scenes. I'll go over how the new functionality works
here, and will rely on one of you to point me to the appropriate on-wiki
place to update documentation.
So basically, visiting /cohorts and clicking "Upload Cohort" works as
before. But once you click "Upload CSV", your form is validated,
processed, and you're taken back to the cohorts page. Your new cohort is
immediately created but is not yet validated. While it validates, you'll
see the validation status and have a few options:
* Remove Cohort. This is destructive and will remove this cohort from your
list. Use this in case you made a mistake, uploaded the wrong file, etc.
* Validate Again. This will run validation again. One possible use for it
is, let's say you upload a cohort with some *very* newly registered users.
And because of replication lag to the labsdb databases, most of them come
up invalid. You can then run validation again.
* Refresh. This just refreshes the status of the validation and will
update the counts that show up below.
You will not have the "Create Report" option until validation is done. And
when you do create a report, only valid users will be considered and used
in the output.
One caveat. Validation is still slow. And the time limit for the
asynchronous task is set to 1 hour. I have some ideas for making this
faster by batching, and I can increase the time limit per task (but that
has other repercussions). For now, just keep in mind that the theoretical
maximum cohort size you should upload is roughly 18,000 users. I would
love some feedback about whether it's ok to increase the time limit or if
people want me to focus on making validation faster.
Dan
The forwarded message is very relevant for wikimetrics. I have some basic backups but we really should migrate the service to production. There are a few things to fix before we do so, but if we don't want any interruptions or data loss, this *has* to happen.
—
Sent from Mailbox for iPhone
---------- Forwarded message ----------
From: "Andrew Bogott" <abogott(a)wikimedia.org>
Date: Fri, Nov 15, 2013 at 2:29 PM
Subject: [Wikitech-l] Labs datacenter migration
To: "A list for announcements and discussion related to the Wikimedia Labs project." <labs-l(a)lists.wikimedia.org>, "Wikimedia developers" <wikitech-l(a)lists.wikimedia.org>
> Almost a year ago, the Wikimedia Foundation migrated most of our
> services from our old data center in Tampa to the new one in Ashburn
> [1]. In the next couple of months Labs and Tool Labs will be following
> suit -- we expect to have everything moved to Ashburn by mid-January at
> the latest.
> This move will provide some immediate benefits (lower latency with
> production, quicker database replication) and many long-term benefits
> (better stability, happier Operations staff). We don't yet have a
> specific timeline for stages of the migration, but there are a few
> things you can do now to help us prepare for the change and to bolster
> your projects against possible disruption.
> 1) Subscribe to Labs-l, and read it. [2] Labs-l is low-volume, and
> future migration announcements may not be sent to other lists.
> 2) Tool Labs users: As long as your tools are properly managed by the
> grid engine and can survive stops and restarts, the migration will be
> quite painless. If your tools aren't, or can't... fix them :)
> 3) Labs project admins: Clean up old projects and instances. If you
> have instances that are no longer of interest, delete them. If you know
> of entire projects that are no longer in use, please contact me directly
> and I'll mop up.
> 4) Labs instance owners: Make sure that puppet is running properly on
> your instances. If '$sudo puppetd -tv' produces any red lines, then fix
> them or contact me for help with fixing. When instances move to the new
> data center we'll be relying on puppet to update location-specific
> settings, so instances without puppet may not survive the move. If your
> instance uses self-hosting puppet (via puppetmaster::self or
> role::puppet::self) then you will also need to update your local puppet
> repo. [3]
> 5) All Labs users: if you have valuable data residing on local instance
> storage, start backing it up to shared storage in /data/project. You
> should be doing this anyway -- no instance is safe from catastrophe, ever.
> 6) If your project or tool generates log files, have a look at purging
> old log data. The last time we did a data migration there was at least
> one terabyte-sized logfile that really gummed up the works.
> Updates about this change will be posted to this list as soon as we
> know about them. Any potential downtime will be announced well in
> advance. In the meantime, don't hesitate to ask questions about the
> above steps on IRC or the mailing list.
> -Andrew
> [1]
> https://blog.wikimedia.org/2013/01/19/wikimedia-sites-move-to-primary-data-…
> [2] https://lists.wikimedia.org/mailman/listinfo/labs-l
> [3] https://wikitech.wikimedia.org/wiki/Help:Self-hosted_puppetmaster#FAQ
> _______________________________________________
> Wikitech-l mailing list
> Wikitech-l(a)lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Hi,
I just noticed someone ran a query from 2012 to 2013 as a timeseries by
hour. This... creates a *lot* of data. For the cohort they used, it's
about 1.8 million pieces of data. Should we cap report sizes somehow? It
doesn't pose any immediate dangers other than taking up a lot of resources
and computation time, as well as IO time spent logging the results (the log
is currently acting as rudimentary backup - perhaps this is ill conceived).
In this case it looks like maybe it was a mistake, so one idea is to warn
the user that they are about to generate a lot of data, and to ask them to
confirm.
Thoughts?
Dan