On Tue, Mar 1, 2016 at 2:26 PM, Joseph Allemandou <jallemandou@wikimedia.org> wrote:

Hi,

TL,DR: Please don't use hive / spark / hadoop before next week.

Last week the Analytics Team performed an upgrade to the Hadoop Cluster.
It went reasonably well except for many of the hadoop processes were launched with a special option to NOT use utf-8 as default encoding.
This issue caused trouble particularly in page title extraction and was detected last sunday (many kudos to the people having filled bugs on Analytics API about encoding :)
We found the bug and fixed it yesterday, and backfill starts today, with the cluster recomputing every dataset starting 2016-02-23 onward.
This means you shouldn't query last week data during this week, first because it is incorrect, and second because you'll curse the cluster for being too slow :)

We are sorry for the inconvenience.
Don't hesitate to contact us if you have any question

--
Joseph Allemandou
Data Engineer @ Wikimedia Foundation
IRC: joal

Joseph Allemandou

Data Engineer @ Wikimedia Foundation

IRC: joal