Re: [Analytics] [Engineering] Hadoop - Last week data needs to be backfilled

1 Mar 2016


      Thanks Joseph. Am I correct in saying that the counts in pageviews are
just the aggregated counts for decoded page titles from
pagecounts-all-sites?
Bo
On Tue, Mar 1, 2016 at 1:39 PM, Joseph Allemandou
jallemandou@wikimedia.org wrote:
...
Hi !
pagecounts are regenerated but shouldn't be impacted by the encoding issue
since page_title is not decoded :)
Files I expect to have changed are the new version of pageviews:
http://dumps.wikimedia.org/other/pageviews/2016/2016-02/
Joseph
On Tue, Mar 1, 2016 at 9:52 PM, Bo Han bo.ning.han@gmail.com wrote:
...
Thanks!
Diffing the newly-uploaded files for 20160223-160000 and
20160223-170000 with the previously-uploaded ones shows that their
contents are the same. Were the original pagecounts files at
http://dumps.wikimedia.org/other/pagecounts-all-sites/2016/2016-02/
not corrupted? The backfill is referring to other data, I assume?
Bo
On Tue, Mar 1, 2016 at 11:26 AM, Andrew Otto otto@wikimedia.org wrote:
...
https://phabricator.wikimedia.org/T128295
On Tue, Mar 1, 2016 at 2:15 PM, Bo Han bo.ning.han@gmail.com wrote:
...
Hi,
Would you mind linking the bug fix here? I couldn't find it on
phabricator.
Thanks,
Bo
On Tue, Mar 1, 2016 at 7:24 AM, Joseph Allemandou
jallemandou@wikimedia.org wrote:
...
Hey Oliver,
It depends on what data you've used: if page_title or other 'encoding
sensitive' data (I can't think of any other, but ...) is part of it,
then
yes, you should !
On Tue, Mar 1, 2016 at 3:27 PM, Oliver Keyes okeyes@wikimedia.org
wrote:
...
Hey Joseph,
Thanks for letting us know. So we should delete and backfill last
week's data, for our regularly scheduled scripts?
On 1 March 2016 at 08:26, Joseph Allemandou
jallemandou@wikimedia.org
wrote:
> Hi,
>
> TL,DR: Please don't use hive / spark / hadoop before next week.
>
> Last week the Analytics Team performed an upgrade to the Hadoop
> Cluster.
> It went reasonably well except for many of the hadoop processes
> were
> launched with a special option to NOT use utf-8 as default
> encoding.
> This issue caused trouble particularly in page title extraction
> and
> was
> detected last sunday (many kudos to the people having filled bugs
> on
> Analytics API about encoding :)
> We found the bug and fixed it yesterday, and backfill starts
> today,
> with
> the
> cluster recomputing every dataset starting 2016-02-23 onward.
> This means you shouldn't query last week data during this week,
> first
> because it is incorrect, and second because you'll curse the
> cluster
> for
> being too slow :)
>
> We are sorry for the inconvenience.
> Don't hesitate to contact us if you have any question
>
>
> --
> Joseph Allemandou
> Data Engineer @ Wikimedia Foundation
> IRC: joal
>
> _______________________________________________
> Engineering mailing list
> Engineering@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/engineering
>
--
Oliver Keyes
Count Logula
Wikimedia Foundation
--
Joseph Allemandou
Data Engineer @ Wikimedia Foundation
IRC: joal

Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics
--
Joseph Allemandou
Data Engineer @ Wikimedia Foundation
IRC: joal

Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

Re: [Analytics] [Engineering] Hadoop - Last week data needs to be backfilled