Wikimetrics has been having serious connectivity problems for a few days. It turned out to be solvable by using some new hostnames ( labsdb1002.eqiad.wmnet). I fixed it just now, please retry your reports and let me know if anything is still wrong.
On Fri, Jan 23, 2015 at 10:46 AM, Dan Andreescu dandreescu@wikimedia.org wrote:
Hi everyone. I will work on this as soon as I get into the office, in about an hour from now. Yuvi suggested one thing that I wasn't aware of that might make this a simple fix.
On Friday, January 23, 2015, Dan Higgins dhiggins@wikimedia.org wrote:
Hi Kevin,
Sorry to be a pest but do you have any update on sorting out the Wikimetrics issues? It seems to have gotten worse since we last spoke to you with around 1 in 10 reports going through.
Thanks,
Dan
On Tue, Jan 20, 2015 at 7:17 PM, Kevin Leduc kevin@wikimedia.org wrote:
All the developers are in transit to SF today. Dan said he'd be in the office this afternoon. First dev I see I'll notify them of problems in wikimetrics.
On Tue, Jan 20, 2015 at 11:10 AM, Amanda Bittaker < abittaker@wikimedia.org> wrote:
Hello again gentlemen,
I think Dan might have already pinged you, but just in case, I wanted to let you know that we are getting these failures again. It's kind of crunch time for getting this data, so we're just banging our heads against the wall and retrying the reports until they work (1 out of 4 times for me.) Is there any way you all could work your magic again?
Many thanks once again, Amanda
On Wed, Dec 10, 2014 at 4:30 PM, Kevin Leduc kevin@wikimedia.org wrote:
It's good to hear it's working again. Don't hesitate to reach out to
us
here or at wikimetrics@lists.wikimedia.org if you notice this kind of trouble again.
On Wed, Dec 10, 2014 at 3:37 PM, Amanda Bittaker <
abittaker@wikimedia.org>
wrote:
It's working perfectly now--a thousand thank yous, Dan and Marcel.
On Wed, Dec 10, 2014 at 3:24 PM, Edward Galvez <
egalvez@wikimedia.org>
wrote: > > Thanks so much Dan and Marcel! > > -E > > > On Wed, Dec 10, 2014 at 3:08 PM, Dan Andreescu <
dandreescu@wikimedia.org>
> wrote: >> >> forgot Marcel - my fault. Jaime & folks, in general Marcel rules
and
>> he's probably going to help you out faster / better than I can. >> >> On Wed, Dec 10, 2014 at 5:57 PM, Dan Andreescu >> dandreescu@wikimedia.org wrote: >>> >>> Ok, Amanda and anyone else who had problems. Please try again. I >>> think I've cleared up some gunk and that might have helped
things. We'll be
>>> looking at performance more closely soon. >>> >>> >>> >>> Steps taken, logging mostly for post-mortem purpose >>> >>> * delete from report where recurrent_parent_id is null and
recurrent =
>>> 0 and created < date('2014-12-01'); >>> ** This deleted records that are not visible in the system
anymore.
>>> They are recoverable from the wikimetrics database backups but we
don't need
>>> them in the database. These probably slowed some things down, in
total the
>>> statement deleted 1623628 rows. >>> >>> * alter table report add column old_recurrent tinyint(1); update
report
>>> set recurrent = 0, old_recurrent = 1 where user_id = 461 and
recurrent = 1;
>>> ** This disables WikimetricsBot recurrent reports, but preserves
the
>>> data so we can deal with them later. When labs is done
re-synchronizing, we
>>> will be re-running these reports. They feed data to Vital Signs,
in case
>>> someone's curious about what they are. >>> >>> * Stopped and rebooted the system. The backup system seems to be >>> hanging or taking a really long time. I'd like to take a look at
this in
>>> more depth, but my guess is the amount it's transferring has gone
beyond
>>> what we expected. >>> >>> On Wed, Dec 10, 2014 at 5:23 PM, Dan Andreescu >>> dandreescu@wikimedia.org wrote: >>>> >>>> We're sorry - the problems we were facing last week have probably >>>> festered. I'm going to turn off some things and reset the
system. I'll
>>>> report back. >>>> >>>> On Wed, Dec 10, 2014 at 4:47 PM, Amanda Bittaker >>>> abittaker@wikimedia.org wrote: >>>>> >>>>> Oh yes, and Jaime did have me restart my browser and clear the
cache,
>>>>> but it did not help. >>>>> >>>>> Thanks again, >>>>> Amanda >>>>> >>>>> On Wed, Dec 10, 2014 at 1:45 PM, Amanda Bittaker >>>>> abittaker@wikimedia.org wrote: >>>>>> >>>>>> Hello Kevin, >>>>>> >>>>>> Jaime asked me to email you about some trouble I've been
having with
>>>>>> Wikimetrics. The whole team has been experiencing a pretty
high rate of
>>>>>> failures in both report creation and cohort uploads. Almost
nothing has
>>>>>> gotten through for me today: of the last 13 reports I've run,
3 were
>>>>>> successful. Of the failures, I would say maybe only two or
three "pended"
>>>>>> at all before becoming failures. I've been experiencing the
same problem
>>>>>> with cohort uploads. >>>>>> >>>>>> The reports have been: Newly Registered, Edits, and Rolling
Active
>>>>>> Editor using expanded cohorts. Please find attached an
example of one of
>>>>>> the reports. I tried uploading cohorts using text files of
user names and
>>>>>> pasting user names from Notepad into the "Paste Usernames"
field. I do
>>>>>> expand the cohorts every time. >>>>>> >>>>>> Do you know why the failure rate is so high, especially this >>>>>> morning, and is there a way to eliminate or mitigate this
problem in the
>>>>>> future? >>>>>> >>>>>> Many thanks for the assistance, and please do let me know if
you
>>>>>> need any more information from me on this. >>>>>> >>>>>> Best, >>>>>> Amanda >>>>> >>>>> >>>> >>> >> > > > > -- > Edward Galvez > Program Evaluation Associate > Wikimedia Foundation