Thanks Joseph! Is it reasonable to assume that the aggregate data in projectview_hourly https://wikitech.wikimedia.org/wiki/Analytics/Data/Projectview_hourly has not been affected?
On Tue, Mar 1, 2016 at 7:24 AM, Joseph Allemandou <jallemandou@wikimedia.org
wrote:
Hey Oliver, It depends on what data you've used: if page_title or other 'encoding sensitive' data (I can't think of any other, but ...) is part of it, then yes, you should !
On Tue, Mar 1, 2016 at 3:27 PM, Oliver Keyes okeyes@wikimedia.org wrote:
Hey Joseph,
Thanks for letting us know. So we should delete and backfill last week's data, for our regularly scheduled scripts?
On 1 March 2016 at 08:26, Joseph Allemandou jallemandou@wikimedia.org wrote:
Hi,
TL,DR: Please don't use hive / spark / hadoop before next week.
Last week the Analytics Team performed an upgrade to the Hadoop Cluster. It went reasonably well except for many of the hadoop processes were launched with a special option to NOT use utf-8 as default encoding. This issue caused trouble particularly in page title extraction and was detected last sunday (many kudos to the people having filled bugs on Analytics API about encoding :) We found the bug and fixed it yesterday, and backfill starts today,
with the
cluster recomputing every dataset starting 2016-02-23 onward. This means you shouldn't query last week data during this week, first because it is incorrect, and second because you'll curse the cluster for being too slow :)
We are sorry for the inconvenience. Don't hesitate to contact us if you have any question
-- Joseph Allemandou Data Engineer @ Wikimedia Foundation IRC: joal
Engineering mailing list Engineering@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/engineering
-- Oliver Keyes Count Logula Wikimedia Foundation
-- *Joseph Allemandou* Data Engineer @ Wikimedia Foundation IRC: joal
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics