Wikimetrics has been having serious connectivity problems for a few days.
It turned out to be solvable by using some new hostnames (
labsdb1002.eqiad.wmnet). I fixed it just now, please retry your reports
and let me know if anything is still wrong.
On Fri, Jan 23, 2015 at 10:46 AM, Dan Andreescu <dandreescu(a)wikimedia.org>
wrote:
> Hi everyone. I will work on this as soon as I get into the office, in
> about an hour from now. Yuvi suggested one thing that I wasn't aware of
> that might make this a simple fix.
>
>
> On Friday, January 23, 2015, Dan Higgins <dhiggins(a)wikimedia.org> wrote:
>
>> Hi Kevin,
>>
>> Sorry to be a pest but do you have any update on sorting out the
>> Wikimetrics issues? It seems to have gotten worse since we last spoke to
>> you with around 1 in 10 reports going through.
>>
>> Thanks,
>>
>> Dan
>>
>> On Tue, Jan 20, 2015 at 7:17 PM, Kevin Leduc <kevin(a)wikimedia.org> wrote:
>>
>>> All the developers are in transit to SF today. Dan said he'd be in the
>>> office this afternoon. First dev I see I'll notify them of problems in
>>> wikimetrics.
>>>
>>> On Tue, Jan 20, 2015 at 11:10 AM, Amanda Bittaker <
>>> abittaker(a)wikimedia.org> wrote:
>>>
>>>> Hello again gentlemen,
>>>>
>>>> I think Dan might have already pinged you, but just in case, I wanted
>>>> to let you know that we are getting these failures again. It's kind
>>>> of crunch time for getting this data, so we're just banging our heads
>>>> against the wall and retrying the reports until they work (1 out of 4
>>>> times for me.) Is there any way you all could work your magic again?
>>>>
>>>> Many thanks once again,
>>>> Amanda
>>>>
>>>>
>>>>
>>>> On Wed, Dec 10, 2014 at 4:30 PM, Kevin Leduc <kevin(a)wikimedia.org>
>>>> wrote:
>>>> > It's good to hear it's working again. Don't hesitate to reach out to
>>>> us
>>>> > here or at wikimetrics(a)lists.wikimedia.org if you notice this kind of
>>>> > trouble again.
>>>> >
>>>> > On Wed, Dec 10, 2014 at 3:37 PM, Amanda Bittaker <
>>>> abittaker(a)wikimedia.org>
>>>> > wrote:
>>>> >>
>>>> >> It's working perfectly now--a thousand thank yous, Dan and Marcel.
>>>> >>
>>>> >> On Wed, Dec 10, 2014 at 3:24 PM, Edward Galvez <
>>>> egalvez(a)wikimedia.org>
>>>> >> wrote:
>>>> >>>
>>>> >>> Thanks so much Dan and Marcel!
>>>> >>>
>>>> >>> -E
>>>> >>>
>>>> >>>
>>>> >>> On Wed, Dec 10, 2014 at 3:08 PM, Dan Andreescu <
>>>> dandreescu(a)wikimedia.org>
>>>> >>> wrote:
>>>> >>>>
>>>> >>>> forgot Marcel - my fault. Jaime & folks, in general Marcel rules
>>>> and
>>>> >>>> he's probably going to help you out faster / better than I can.
>>>> >>>>
>>>> >>>> On Wed, Dec 10, 2014 at 5:57 PM, Dan Andreescu
>>>> >>>> <dandreescu(a)wikimedia.org> wrote:
>>>> >>>>>
>>>> >>>>> Ok, Amanda and anyone else who had problems. Please try again. I
>>>> >>>>> think I've cleared up some gunk and that might have helped
>>>> things. We'll be
>>>> >>>>> looking at performance more closely soon.
>>>> >>>>>
>>>> >>>>>
>>>> >>>>>
>>>> >>>>> Steps taken, logging mostly for post-mortem purpose
>>>> >>>>>
>>>> >>>>> * delete from report where recurrent_parent_id is null and
>>>> recurrent =
>>>> >>>>> 0 and created < date('2014-12-01');
>>>> >>>>> ** This deleted records that are not visible in the system
>>>> anymore.
>>>> >>>>> They are recoverable from the wikimetrics database backups but we
>>>> don't need
>>>> >>>>> them in the database. These probably slowed some things down, in
>>>> total the
>>>> >>>>> statement deleted 1623628 rows.
>>>> >>>>>
>>>> >>>>> * alter table report add column old_recurrent tinyint(1); update
>>>> report
>>>> >>>>> set recurrent = 0, old_recurrent = 1 where user_id = 461 and
>>>> recurrent = 1;
>>>> >>>>> ** This disables WikimetricsBot recurrent reports, but preserves
>>>> the
>>>> >>>>> data so we can deal with them later. When labs is done
>>>> re-synchronizing, we
>>>> >>>>> will be re-running these reports. They feed data to Vital Signs,
>>>> in case
>>>> >>>>> someone's curious about what they are.
>>>> >>>>>
>>>> >>>>> * Stopped and rebooted the system. The backup system seems to be
>>>> >>>>> hanging or taking a really long time. I'd like to take a look at
>>>> this in
>>>> >>>>> more depth, but my guess is the amount it's transferring has gone
>>>> beyond
>>>> >>>>> what we expected.
>>>> >>>>>
>>>> >>>>> On Wed, Dec 10, 2014 at 5:23 PM, Dan Andreescu
>>>> >>>>> <dandreescu(a)wikimedia.org> wrote:
>>>> >>>>>>
>>>> >>>>>> We're sorry - the problems we were facing last week have probably
>>>> >>>>>> festered. I'm going to turn off some things and reset the
>>>> system. I'll
>>>> >>>>>> report back.
>>>> >>>>>>
>>>> >>>>>> On Wed, Dec 10, 2014 at 4:47 PM, Amanda Bittaker
>>>> >>>>>> <abittaker(a)wikimedia.org> wrote:
>>>> >>>>>>>
>>>> >>>>>>> Oh yes, and Jaime did have me restart my browser and clear the
>>>> cache,
>>>> >>>>>>> but it did not help.
>>>> >>>>>>>
>>>> >>>>>>> Thanks again,
>>>> >>>>>>> Amanda
>>>> >>>>>>>
>>>> >>>>>>> On Wed, Dec 10, 2014 at 1:45 PM, Amanda Bittaker
>>>> >>>>>>> <abittaker(a)wikimedia.org> wrote:
>>>> >>>>>>>>
>>>> >>>>>>>> Hello Kevin,
>>>> >>>>>>>>
>>>> >>>>>>>> Jaime asked me to email you about some trouble I've been
>>>> having with
>>>> >>>>>>>> Wikimetrics. The whole team has been experiencing a pretty
>>>> high rate of
>>>> >>>>>>>> failures in both report creation and cohort uploads. Almost
>>>> nothing has
>>>> >>>>>>>> gotten through for me today: of the last 13 reports I've run,
>>>> 3 were
>>>> >>>>>>>> successful. Of the failures, I would say maybe only two or
>>>> three "pended"
>>>> >>>>>>>> at all before becoming failures. I've been experiencing the
>>>> same problem
>>>> >>>>>>>> with cohort uploads.
>>>> >>>>>>>>
>>>> >>>>>>>> The reports have been: Newly Registered, Edits, and Rolling
>>>> Active
>>>> >>>>>>>> Editor using expanded cohorts. Please find attached an
>>>> example of one of
>>>> >>>>>>>> the reports. I tried uploading cohorts using text files of
>>>> user names and
>>>> >>>>>>>> pasting user names from Notepad into the "Paste Usernames"
>>>> field. I do
>>>> >>>>>>>> expand the cohorts every time.
>>>> >>>>>>>>
>>>> >>>>>>>> Do you know why the failure rate is so high, especially this
>>>> >>>>>>>> morning, and is there a way to eliminate or mitigate this
>>>> problem in the
>>>> >>>>>>>> future?
>>>> >>>>>>>>
>>>> >>>>>>>> Many thanks for the assistance, and please do let me know if
>>>> you
>>>> >>>>>>>> need any more information from me on this.
>>>> >>>>>>>>
>>>> >>>>>>>> Best,
>>>> >>>>>>>> Amanda
>>>> >>>>>>>
>>>> >>>>>>>
>>>> >>>>>>
>>>> >>>>>
>>>> >>>>
>>>> >>>
>>>> >>>
>>>> >>>
>>>> >>> --
>>>> >>> Edward Galvez
>>>> >>> Program Evaluation Associate
>>>> >>> Wikimedia Foundation
>>>> >>
>>>> >>
>>>> >
>>>>
>>>
>>>
>>