I know it might take a while, but we're re-loading the data in Cassandraā€ˇ too, right?

From: Joseph Allemandou
Sent: Tuesday, March 1, 2016 16:39
To: A mailing list for the Analytics Team at WMF and everybody who has an interest in Wikipedia and analytics.
Reply To: A mailing list for the Analytics Team at WMF and everybody who has an interest in Wikipedia and analytics.
Subject: Re: [Analytics] [Engineering] Hadoop - Last week data needs to be backfilled

Hi !
pagecounts are regenerated but shouldn't be impacted by the encoding issue since page_title is not decoded :)
Files I expect to have changed are the new version of pageviews: http://dumps.wikimedia.org/other/pageviews/2016/2016-02/
Joseph

On Tue, Mar 1, 2016 at 9:52 PM, Bo Han <bo.ning.han@gmail.com> wrote:
Thanks!

Diffing the newly-uploaded files for 20160223-160000 and
20160223-170000 with the previously-uploaded ones shows that their
contents are the same. Were the original pagecounts files at
http://dumps.wikimedia.org/other/pagecounts-all-sites/2016/2016-02/
not corrupted? The backfill is referring to other data, I assume?

Bo

On Tue, Mar 1, 2016 at 11:26 AM, Andrew Otto <otto@wikimedia.org> wrote:
> https://phabricator.wikimedia.org/T128295
>
> On Tue, Mar 1, 2016 at 2:15 PM, Bo Han <bo.ning.han@gmail.com> wrote:
>>
>> Hi,
>>
>> Would you mind linking the bug fix here? I couldn't find it on
>> phabricator.
>>
>> Thanks,
>> Bo
>>
>> On Tue, Mar 1, 2016 at 7:24 AM, Joseph Allemandou
>> <jallemandou@wikimedia.org> wrote:
>> > Hey Oliver,
>> > It depends on what data you've used: if page_title or other 'encoding
>> > sensitive' data (I can't think of any other, but ...) is part of it,
>> > then
>> > yes, you should !
>> >
>> > On Tue, Mar 1, 2016 at 3:27 PM, Oliver Keyes <okeyes@wikimedia.org>
>> > wrote:
>> >>
>> >> Hey Joseph,
>> >>
>> >> Thanks for letting us know. So we should delete and backfill last
>> >> week's data, for our regularly scheduled scripts?
>> >>
>> >> On 1 March 2016 at 08:26, Joseph Allemandou <jallemandou@wikimedia.org>
>> >> wrote:
>> >> > Hi,
>> >> >
>> >> > TL,DR: Please don't use hive / spark / hadoop before next week.
>> >> >
>> >> > Last week the Analytics Team performed an upgrade to the Hadoop
>> >> > Cluster.
>> >> > It went reasonably well except for many of the hadoop processes were
>> >> > launched with a special option to NOT use utf-8 as default encoding.
>> >> > This issue caused trouble particularly in page title extraction and
>> >> > was
>> >> > detected last sunday (many kudos to the people having filled bugs on
>> >> > Analytics API about encoding :)
>> >> > We found the bug and fixed it yesterday, and backfill starts today,
>> >> > with
>> >> > the
>> >> > cluster recomputing every dataset starting 2016-02-23 onward.
>> >> > This means you shouldn't query last week data during this week, first
>> >> > because it is incorrect, and second because you'll curse the cluster
>> >> > for
>> >> > being too slow :)
>> >> >
>> >> > We are sorry for the inconvenience.
>> >> > Don't hesitate to contact us if you have any question
>> >> >
>> >> >
>> >> > --
>> >> > Joseph Allemandou
>> >> > Data Engineer @ Wikimedia Foundation
>> >> > IRC: joal
>> >> >
>> >> > _______________________________________________
>> >> > Engineering mailing list
>> >> > Engineering@lists.wikimedia.org
>> >> > https://lists.wikimedia.org/mailman/listinfo/engineering
>> >> >
>> >>
>> >>
>> >>
>> >> --
>> >> Oliver Keyes
>> >> Count Logula
>> >> Wikimedia Foundation
>> >
>> >
>> >
>> >
>> > --
>> > Joseph Allemandou
>> > Data Engineer @ Wikimedia Foundation
>> > IRC: joal
>> >
>> > _______________________________________________
>> > Analytics mailing list
>> > Analytics@lists.wikimedia.org
>> > https://lists.wikimedia.org/mailman/listinfo/analytics
>> >
>>
>> _______________________________________________
>> Analytics mailing list
>> Analytics@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/analytics
>
>
>
> _______________________________________________
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics
>

_______________________________________________
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics



--
Joseph Allemandou
Data Engineer @ Wikimedia Foundation
IRC: joal