Re: [Analytics] [Engineering] Hadoop - Last week data needs to be backfilled

2 Mar 2016

      After meeting with the team:
 - Encoding issue was due to locale wrongly set on some machines (but we
don't know why)
 - We will find a way to enforce file.encoding, first looking for a
java-global way, if not feasible, a process-local way.
 - We will NOT spend computing resource on a job trying to detect this
issue (too costly for occurrence probability, particularly if we force
file.encoding)
Cheers
Joseph
On Wed, Mar 2, 2016 at 11:24 AM, Joseph Allemandou <
jallemandou@wikimedia.org> wrote:
...
@Ori: Needs to be discussed with the team - My 2 cents

Detection: possible to implement as part of one of the oozie jobs.

we will compute number of pages having different page_title for the same
   uri_path (if high, not good).

Prevention: 2 things possible
Try to understand WHY this thing happened (very difficult I

think, possibly related to weird state after upgrade) and ensure we don't
 fall into that state again
Force JVM file.encoding for every java process of the cluster

(probably easier but not really easy not to forget anything)

I'd love to have your thoughts / ideas and discuss them with the team.
Thanks
On Wed, Mar 2, 2016 at 10:53 AM, Ori Livneh ori@wikimedia.org wrote:
...
So: what is the planning for making sure this doesn't happen the next
time around? :)
On Tue, Mar 1, 2016 at 5:26 AM, Joseph Allemandou <
jallemandou@wikimedia.org> wrote:
...
Hi,
*TL,DR: Please don't use hive / spark / hadoop before next week.*
Last week the Analytics Team performed an upgrade to the Hadoop Cluster.
It went reasonably well except for many of the hadoop processes were
launched with a special option to NOT use utf-8 as default encoding.
This issue caused trouble particularly in page title extraction and was
detected last sunday (many kudos to the people having filled bugs on
Analytics API about encoding :)
We found the bug and fixed it yesterday, and backfill starts today, with
the cluster recomputing every dataset starting 2016-02-23 onward.
This means you shouldn't query last week data during this week, first
because it is incorrect, and second because you'll curse the cluster for
being too slow :)
We are sorry for the inconvenience.
Don't hesitate to contact us if you have any question
--
*Joseph Allemandou*
Data Engineer @ Wikimedia Foundation
IRC: joal

Engineering mailing list
Engineering@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/engineering
--
*Joseph Allemandou*
Data Engineer @ Wikimedia Foundation
IRC: joal
-- 
*Joseph Allemandou*
Data Engineer @ Wikimedia Foundation
IRC: joal

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

Re: [Analytics] [Engineering] Hadoop - Last week data needs to be backfilled