Wikimetrics November 2013

wikimetrics@lists.wikimedia.org

7 participants
6 discussions

user_id and user_name distinction
by Dan Andreescu 27 Nov '13

27 Nov '13

Hi everyone, A quick note about something that just messed me up. When uploading a cohort to wikimetrics, you are told you can use either user_name, user_id, or a mixture in the first column. However, this can really produce unexpected results if you don't know how it works. I think it needs to change, but until then, this is how it works and how it can bite you: Let's say I have a list of users: 1,en 2,en 3,en When it validates, it will look up user_name == 1, if it doesn't find anything it will look up user_id == 1. Then user_name == 2, user_id == 2, user_name == 3, user_id == 3. If what you meant with the above cohort was the users with ids 1, 2, and 3, then you might be very confused later when you see user id 234215 in your output results. This might happen if a user_name is actually 2! So, for now, until I figure out how to fix this, it will always prefer user_names before user_ids. Please let me know if this is confusing. Also, the whole problem stems from needing to accept both user_id and user_name in the *same* upload. If everyone agrees, I'd much rather just allow people to toggle between one or the other. This would speed up validation and make it much clearer what is going on.

6 16

temporary outage
by Dan Andreescu 26 Nov '13

26 Nov '13

Hi, I'm taking wikimetrics down temporarily to update configuration. I'll follow up when it's back up. Dan

1 1

Asynchronous Cohort Upload Deployed
by Dan Andreescu 23 Nov '13

23 Nov '13

Dear Wikimetrics users, I've just deployed asynchronous cohort upload. This is feature #818: https://mingle.corp.wikimedia.org/projects/analytics/cards/818 and basically allows you to upload larger cohorts because validation is happening behind the scenes. I'll go over how the new functionality works here, and will rely on one of you to point me to the appropriate on-wiki place to update documentation. So basically, visiting /cohorts and clicking "Upload Cohort" works as before. But once you click "Upload CSV", your form is validated, processed, and you're taken back to the cohorts page. Your new cohort is immediately created but is not yet validated. While it validates, you'll see the validation status and have a few options: * Remove Cohort. This is destructive and will remove this cohort from your list. Use this in case you made a mistake, uploaded the wrong file, etc. * Validate Again. This will run validation again. One possible use for it is, let's say you upload a cohort with some *very* newly registered users. And because of replication lag to the labsdb databases, most of them come up invalid. You can then run validation again. * Refresh. This just refreshes the status of the validation and will update the counts that show up below. You will not have the "Create Report" option until validation is done. And when you do create a report, only valid users will be considered and used in the output. One caveat. Validation is still slow. And the time limit for the asynchronous task is set to 1 hour. I have some ideas for making this faster by batching, and I can increase the time limit per task (but that has other repercussions). For now, just keep in mind that the theoretical maximum cohort size you should upload is roughly 18,000 users. I would love some feedback about whether it's ok to increase the time limit or if people want me to focus on making validation faster. Dan

5 16

deployment
by Dan Andreescu 22 Nov '13

22 Nov '13

I just deployed standard deviation as a new aggregator. You can select it from "Configure Output" along with Average and Sum.

2 1

Fwd: [Wikitech-l] Labs datacenter migration
by Dan Andreescu 16 Nov '13

16 Nov '13

The forwarded message is very relevant for wikimetrics. I have some basic backups but we really should migrate the service to production. There are a few things to fix before we do so, but if we don't want any interruptions or data loss, this *has* to happen. — Sent from Mailbox for iPhone ---------- Forwarded message ---------- From: "Andrew Bogott" <abogott(a)wikimedia.org> Date: Fri, Nov 15, 2013 at 2:29 PM Subject: [Wikitech-l] Labs datacenter migration To: "A list for announcements and discussion related to the Wikimedia Labs project." <labs-l(a)lists.wikimedia.org>, "Wikimedia developers" <wikitech-l(a)lists.wikimedia.org> > Almost a year ago, the Wikimedia Foundation migrated most of our > services from our old data center in Tampa to the new one in Ashburn > [1]. In the next couple of months Labs and Tool Labs will be following > suit -- we expect to have everything moved to Ashburn by mid-January at > the latest. > This move will provide some immediate benefits (lower latency with > production, quicker database replication) and many long-term benefits > (better stability, happier Operations staff). We don't yet have a > specific timeline for stages of the migration, but there are a few > things you can do now to help us prepare for the change and to bolster > your projects against possible disruption. > 1) Subscribe to Labs-l, and read it. [2] Labs-l is low-volume, and > future migration announcements may not be sent to other lists. > 2) Tool Labs users: As long as your tools are properly managed by the > grid engine and can survive stops and restarts, the migration will be > quite painless. If your tools aren't, or can't... fix them :) > 3) Labs project admins: Clean up old projects and instances. If you > have instances that are no longer of interest, delete them. If you know > of entire projects that are no longer in use, please contact me directly > and I'll mop up. > 4) Labs instance owners: Make sure that puppet is running properly on > your instances. If '$sudo puppetd -tv' produces any red lines, then fix > them or contact me for help with fixing. When instances move to the new > data center we'll be relying on puppet to update location-specific > settings, so instances without puppet may not survive the move. If your > instance uses self-hosting puppet (via puppetmaster::self or > role::puppet::self) then you will also need to update your local puppet > repo. [3] > 5) All Labs users: if you have valuable data residing on local instance > storage, start backing it up to shared storage in /data/project. You > should be doing this anyway -- no instance is safe from catastrophe, ever. > 6) If your project or tool generates log files, have a look at purging > old log data. The last time we did a data migration there was at least > one terabyte-sized logfile that really gummed up the works. > Updates about this change will be posted to this list as soon as we > know about them. Any potential downtime will be announced well in > advance. In the meantime, don't hesitate to ask questions about the > above steps on IRC or the mailing list. > -Andrew > [1] > https://blog.wikimedia.org/2013/01/19/wikimedia-sites-move-to-primary-data-… > [2] https://lists.wikimedia.org/mailman/listinfo/labs-l > [3] https://wikitech.wikimedia.org/wiki/Help:Self-hosted_puppetmaster#FAQ > _______________________________________________ > Wikitech-l mailing list > Wikitech-l(a)lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/wikitech-l

1 0

wikimetrics policy
by Dan Andreescu 04 Nov '13

04 Nov '13

Hi, I just noticed someone ran a query from 2012 to 2013 as a timeseries by hour. This... creates a *lot* of data. For the cohort they used, it's about 1.8 million pieces of data. Should we cap report sizes somehow? It doesn't pose any immediate dangers other than taking up a lot of resources and computation time, as well as IO time spent logging the results (the log is currently acting as rudimentary backup - perhaps this is ill conceived). In this case it looks like maybe it was a mistake, so one idea is to warn the user that they are about to generate a lot of data, and to ask them to confirm. Thoughts? Dan

4 7

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

Wikimetrics November 2013