After meeting with the team:
 - Encoding issue was due to locale wrongly set on some machines (but we don't know why)
 - We will find a way to enforce file.encoding, first looking for a java-global way, if not feasible, a process-local way.
 - We will NOT spend computing resource on a job trying to detect this issue (too costly for occurrence probability, particularly if we force file.encoding)
Cheers
Joseph

On Wed, Mar 2, 2016 at 11:24 AM, Joseph Allemandou <jallemandou@wikimedia.org> wrote:
@Ori: Needs to be discussed with the team - My 2 cents
  • Detection: possible to implement as part of one of the oozie jobs. we will compute number of pages having different page_title for the same uri_path (if high, not good).
  • Prevention: 2 things possible
    • Try to understand WHY this thing happened (very difficult I think, possibly related to weird state after upgrade) and ensure we don't fall into that state again
    • Force JVM file.encoding for every java process of the cluster (probably easier but not really easy not to forget anything)
I'd love to have your thoughts / ideas and discuss them with the team.
Thanks

On Wed, Mar 2, 2016 at 10:53 AM, Ori Livneh <ori@wikimedia.org> wrote:
So: what is the planning for making sure this doesn't happen the next time around? :)

On Tue, Mar 1, 2016 at 5:26 AM, Joseph Allemandou <jallemandou@wikimedia.org> wrote:
Hi,

TL,DR: Please don't use hive / spark / hadoop before next week.

Last week the Analytics Team performed an upgrade to the Hadoop Cluster.
It went reasonably well except for many of the hadoop processes were launched with a special option to NOT use utf-8 as default encoding.
This issue caused trouble particularly in page title extraction and was detected last sunday (many kudos to the people having filled bugs on Analytics API about encoding :)
We found the bug and fixed it yesterday, and backfill starts today, with the cluster recomputing every dataset starting 2016-02-23 onward.
This means you shouldn't query last week data during this week, first because it is incorrect, and second because you'll curse the cluster for being too slow :)

We are sorry for the inconvenience.
Don't hesitate to contact us if you have any question


--
Joseph Allemandou
Data Engineer @ Wikimedia Foundation
IRC: joal

_______________________________________________
Engineering mailing list
Engineering@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/engineering





--
Joseph Allemandou
Data Engineer @ Wikimedia Foundation
IRC: joal



--
Joseph Allemandou
Data Engineer @ Wikimedia Foundation
IRC: joal