Re: [Analytics] [Engineering] Hadoop - Last week data needs to be backfilled

2 Mar 2016


      Hi !
pagecounts are regenerated but shouldn't be impacted by the encoding issue
since page_title is not decoded :)
Files I expect to have changed are the new version of pageviews:
http://dumps.wikimedia.org/other/pageviews/2016/2016-02/
Joseph
On Tue, Mar 1, 2016 at 9:52 PM, Bo Han bo.ning.han@gmail.com wrote:
...
Thanks!
Diffing the newly-uploaded files for 20160223-160000 and
20160223-170000 with the previously-uploaded ones shows that their
contents are the same. Were the original pagecounts files at
http://dumps.wikimedia.org/other/pagecounts-all-sites/2016/2016-02/
not corrupted? The backfill is referring to other data, I assume?
Bo
On Tue, Mar 1, 2016 at 11:26 AM, Andrew Otto otto@wikimedia.org wrote:
...
https://phabricator.wikimedia.org/T128295
On Tue, Mar 1, 2016 at 2:15 PM, Bo Han bo.ning.han@gmail.com wrote:
...
Hi,
Would you mind linking the bug fix here? I couldn't find it on
phabricator.
Thanks,
Bo
On Tue, Mar 1, 2016 at 7:24 AM, Joseph Allemandou
jallemandou@wikimedia.org wrote:
...
Hey Oliver,
It depends on what data you've used: if page_title or other 'encoding
sensitive' data (I can't think of any other, but ...) is part of it,
then
yes, you should !
On Tue, Mar 1, 2016 at 3:27 PM, Oliver Keyes okeyes@wikimedia.org
wrote:
...
Hey Joseph,
Thanks for letting us know. So we should delete and backfill last
week's data, for our regularly scheduled scripts?
On 1 March 2016 at 08:26, Joseph Allemandou <
jallemandou@wikimedia.org>
...
...
...
...
wrote:
...
Hi,
TL,DR: Please don't use hive / spark / hadoop before next week.
Last week the Analytics Team performed an upgrade to the Hadoop
Cluster.
It went reasonably well except for many of the hadoop processes
were
...
...
...
...
...
launched with a special option to NOT use utf-8 as default
encoding.
...
...
...
...
...
This issue caused trouble particularly in page title extraction and
was
detected last sunday (many kudos to the people having filled bugs
on
...
...
...
...
...
Analytics API about encoding :)
We found the bug and fixed it yesterday, and backfill starts today,
with
the
cluster recomputing every dataset starting 2016-02-23 onward.
This means you shouldn't query last week data during this week,
first
...
...
...
...
...
because it is incorrect, and second because you'll curse the
cluster
...
...
...
...
...
for
being too slow :)
We are sorry for the inconvenience.
Don't hesitate to contact us if you have any question
--
Joseph Allemandou
Data Engineer @ Wikimedia Foundation
IRC: joal

Engineering mailing list
Engineering@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/engineering
--
Oliver Keyes
Count Logula
Wikimedia Foundation
--
Joseph Allemandou
Data Engineer @ Wikimedia Foundation
IRC: joal

Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics
-- 
*Joseph Allemandou*
Data Engineer @ Wikimedia Foundation
IRC: joal

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

Re: [Analytics] [Engineering] Hadoop - Last week data needs to be backfilled