On Wed, Mar 2, 2016 at 11:24 AM, Joseph Allemandou <jallemandou@wikimedia.org> wrote:

@Ori: Needs to be discussed with the team - My 2 cents
Detection: possible to implement as part of one of the oozie jobs. we will compute number of pages having different page_title for the same uri_path (if high, not good).
Prevention: 2 things possible
Try to understand WHY this thing happened (very difficult I think, possibly related to weird state after upgrade) and ensure we don't fall into that state again
Force JVM file.encoding for every java process of the cluster (probably easier but not really easy not to forget anything)
I'd love to have your thoughts / ideas and discuss them with the team.
Thanks

On Wed, Mar 2, 2016 at 10:53 AM, Ori Livneh <ori@wikimedia.org> wrote:
So: what is the planning for making sure this doesn't happen the next time around? :)

On Tue, Mar 1, 2016 at 5:26 AM, Joseph Allemandou <jallemandou@wikimedia.org> wrote:
Hi,

TL,DR: Please don't use hive / spark / hadoop before next week.

Last week the Analytics Team performed an upgrade to the Hadoop Cluster.
It went reasonably well except for many of the hadoop processes were launched with a special option to NOT use utf-8 as default encoding.
This issue caused trouble particularly in page title extraction and was detected last sunday (many kudos to the people having filled bugs on Analytics API about encoding :)
We found the bug and fixed it yesterday, and backfill starts today, with the cluster recomputing every dataset starting 2016-02-23 onward.
This means you shouldn't query last week data during this week, first because it is incorrect, and second because you'll curse the cluster for being too slow :)

We are sorry for the inconvenience.
Don't hesitate to contact us if you have any question

--
Joseph Allemandou
Data Engineer @ Wikimedia Foundation
IRC: joal

_______________________________________________
Engineering mailing list
Engineering@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/engineering

--
Joseph Allemandou
Data Engineer @ Wikimedia Foundation
IRC: joal

Joseph Allemandou

Data Engineer @ Wikimedia Foundation

IRC: joal