Hi,
*TL,DR: Please don't use hive / spark / hadoop before next week.*
Last week the Analytics Team performed an upgrade to the Hadoop Cluster. It went reasonably well except for many of the hadoop processes were launched with a special option to NOT use utf-8 as default encoding. This issue caused trouble particularly in page title extraction and was detected last sunday (many kudos to the people having filled bugs on Analytics API about encoding :) We found the bug and fixed it yesterday, and backfill starts today, with the cluster recomputing every dataset starting 2016-02-23 onward. This means you shouldn't query last week data during this week, first because it is incorrect, and second because you'll curse the cluster for being too slow :)
We are sorry for the inconvenience. Don't hesitate to contact us if you have any question
Hey Joseph,
Thanks for letting us know. So we should delete and backfill last week's data, for our regularly scheduled scripts?
On 1 March 2016 at 08:26, Joseph Allemandou jallemandou@wikimedia.org wrote:
Hi,
TL,DR: Please don't use hive / spark / hadoop before next week.
Last week the Analytics Team performed an upgrade to the Hadoop Cluster. It went reasonably well except for many of the hadoop processes were launched with a special option to NOT use utf-8 as default encoding. This issue caused trouble particularly in page title extraction and was detected last sunday (many kudos to the people having filled bugs on Analytics API about encoding :) We found the bug and fixed it yesterday, and backfill starts today, with the cluster recomputing every dataset starting 2016-02-23 onward. This means you shouldn't query last week data during this week, first because it is incorrect, and second because you'll curse the cluster for being too slow :)
We are sorry for the inconvenience. Don't hesitate to contact us if you have any question
-- Joseph Allemandou Data Engineer @ Wikimedia Foundation IRC: joal
Engineering mailing list Engineering@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/engineering
Hey Oliver, It depends on what data you've used: if page_title or other 'encoding sensitive' data (I can't think of any other, but ...) is part of it, then yes, you should !
On Tue, Mar 1, 2016 at 3:27 PM, Oliver Keyes okeyes@wikimedia.org wrote:
Hey Joseph,
Thanks for letting us know. So we should delete and backfill last week's data, for our regularly scheduled scripts?
On 1 March 2016 at 08:26, Joseph Allemandou jallemandou@wikimedia.org wrote:
Hi,
TL,DR: Please don't use hive / spark / hadoop before next week.
Last week the Analytics Team performed an upgrade to the Hadoop Cluster. It went reasonably well except for many of the hadoop processes were launched with a special option to NOT use utf-8 as default encoding. This issue caused trouble particularly in page title extraction and was detected last sunday (many kudos to the people having filled bugs on Analytics API about encoding :) We found the bug and fixed it yesterday, and backfill starts today, with
the
cluster recomputing every dataset starting 2016-02-23 onward. This means you shouldn't query last week data during this week, first because it is incorrect, and second because you'll curse the cluster for being too slow :)
We are sorry for the inconvenience. Don't hesitate to contact us if you have any question
-- Joseph Allemandou Data Engineer @ Wikimedia Foundation IRC: joal
Engineering mailing list Engineering@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/engineering
-- Oliver Keyes Count Logula Wikimedia Foundation
Hi,
Would you mind linking the bug fix here? I couldn't find it on phabricator.
Thanks, Bo
On Tue, Mar 1, 2016 at 7:24 AM, Joseph Allemandou jallemandou@wikimedia.org wrote:
Hey Oliver, It depends on what data you've used: if page_title or other 'encoding sensitive' data (I can't think of any other, but ...) is part of it, then yes, you should !
On Tue, Mar 1, 2016 at 3:27 PM, Oliver Keyes okeyes@wikimedia.org wrote:
Hey Joseph,
Thanks for letting us know. So we should delete and backfill last week's data, for our regularly scheduled scripts?
On 1 March 2016 at 08:26, Joseph Allemandou jallemandou@wikimedia.org wrote:
Hi,
TL,DR: Please don't use hive / spark / hadoop before next week.
Last week the Analytics Team performed an upgrade to the Hadoop Cluster. It went reasonably well except for many of the hadoop processes were launched with a special option to NOT use utf-8 as default encoding. This issue caused trouble particularly in page title extraction and was detected last sunday (many kudos to the people having filled bugs on Analytics API about encoding :) We found the bug and fixed it yesterday, and backfill starts today, with the cluster recomputing every dataset starting 2016-02-23 onward. This means you shouldn't query last week data during this week, first because it is incorrect, and second because you'll curse the cluster for being too slow :)
We are sorry for the inconvenience. Don't hesitate to contact us if you have any question
-- Joseph Allemandou Data Engineer @ Wikimedia Foundation IRC: joal
Engineering mailing list Engineering@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/engineering
-- Oliver Keyes Count Logula Wikimedia Foundation
-- Joseph Allemandou Data Engineer @ Wikimedia Foundation IRC: joal
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
https://phabricator.wikimedia.org/T128295
On Tue, Mar 1, 2016 at 2:15 PM, Bo Han bo.ning.han@gmail.com wrote:
Hi,
Would you mind linking the bug fix here? I couldn't find it on phabricator.
Thanks, Bo
On Tue, Mar 1, 2016 at 7:24 AM, Joseph Allemandou jallemandou@wikimedia.org wrote:
Hey Oliver, It depends on what data you've used: if page_title or other 'encoding sensitive' data (I can't think of any other, but ...) is part of it, then yes, you should !
On Tue, Mar 1, 2016 at 3:27 PM, Oliver Keyes okeyes@wikimedia.org
wrote:
Hey Joseph,
Thanks for letting us know. So we should delete and backfill last week's data, for our regularly scheduled scripts?
On 1 March 2016 at 08:26, Joseph Allemandou jallemandou@wikimedia.org wrote:
Hi,
TL,DR: Please don't use hive / spark / hadoop before next week.
Last week the Analytics Team performed an upgrade to the Hadoop
Cluster.
It went reasonably well except for many of the hadoop processes were launched with a special option to NOT use utf-8 as default encoding. This issue caused trouble particularly in page title extraction and
was
detected last sunday (many kudos to the people having filled bugs on Analytics API about encoding :) We found the bug and fixed it yesterday, and backfill starts today,
with
the cluster recomputing every dataset starting 2016-02-23 onward. This means you shouldn't query last week data during this week, first because it is incorrect, and second because you'll curse the cluster
for
being too slow :)
We are sorry for the inconvenience. Don't hesitate to contact us if you have any question
-- Joseph Allemandou Data Engineer @ Wikimedia Foundation IRC: joal
Engineering mailing list Engineering@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/engineering
-- Oliver Keyes Count Logula Wikimedia Foundation
-- Joseph Allemandou Data Engineer @ Wikimedia Foundation IRC: joal
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Thanks!
Diffing the newly-uploaded files for 20160223-160000 and 20160223-170000 with the previously-uploaded ones shows that their contents are the same. Were the original pagecounts files at http://dumps.wikimedia.org/other/pagecounts-all-sites/2016/2016-02/ not corrupted? The backfill is referring to other data, I assume?
Bo
On Tue, Mar 1, 2016 at 11:26 AM, Andrew Otto otto@wikimedia.org wrote:
https://phabricator.wikimedia.org/T128295
On Tue, Mar 1, 2016 at 2:15 PM, Bo Han bo.ning.han@gmail.com wrote:
Hi,
Would you mind linking the bug fix here? I couldn't find it on phabricator.
Thanks, Bo
On Tue, Mar 1, 2016 at 7:24 AM, Joseph Allemandou jallemandou@wikimedia.org wrote:
Hey Oliver, It depends on what data you've used: if page_title or other 'encoding sensitive' data (I can't think of any other, but ...) is part of it, then yes, you should !
On Tue, Mar 1, 2016 at 3:27 PM, Oliver Keyes okeyes@wikimedia.org wrote:
Hey Joseph,
Thanks for letting us know. So we should delete and backfill last week's data, for our regularly scheduled scripts?
On 1 March 2016 at 08:26, Joseph Allemandou jallemandou@wikimedia.org wrote:
Hi,
TL,DR: Please don't use hive / spark / hadoop before next week.
Last week the Analytics Team performed an upgrade to the Hadoop Cluster. It went reasonably well except for many of the hadoop processes were launched with a special option to NOT use utf-8 as default encoding. This issue caused trouble particularly in page title extraction and was detected last sunday (many kudos to the people having filled bugs on Analytics API about encoding :) We found the bug and fixed it yesterday, and backfill starts today, with the cluster recomputing every dataset starting 2016-02-23 onward. This means you shouldn't query last week data during this week, first because it is incorrect, and second because you'll curse the cluster for being too slow :)
We are sorry for the inconvenience. Don't hesitate to contact us if you have any question
-- Joseph Allemandou Data Engineer @ Wikimedia Foundation IRC: joal
Engineering mailing list Engineering@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/engineering
-- Oliver Keyes Count Logula Wikimedia Foundation
-- Joseph Allemandou Data Engineer @ Wikimedia Foundation IRC: joal
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Hi ! pagecounts are regenerated but shouldn't be impacted by the encoding issue since page_title is not decoded :) Files I expect to have changed are the new version of pageviews: http://dumps.wikimedia.org/other/pageviews/2016/2016-02/ Joseph
On Tue, Mar 1, 2016 at 9:52 PM, Bo Han bo.ning.han@gmail.com wrote:
Thanks!
Diffing the newly-uploaded files for 20160223-160000 and 20160223-170000 with the previously-uploaded ones shows that their contents are the same. Were the original pagecounts files at http://dumps.wikimedia.org/other/pagecounts-all-sites/2016/2016-02/ not corrupted? The backfill is referring to other data, I assume?
Bo
On Tue, Mar 1, 2016 at 11:26 AM, Andrew Otto otto@wikimedia.org wrote:
https://phabricator.wikimedia.org/T128295
On Tue, Mar 1, 2016 at 2:15 PM, Bo Han bo.ning.han@gmail.com wrote:
Hi,
Would you mind linking the bug fix here? I couldn't find it on phabricator.
Thanks, Bo
On Tue, Mar 1, 2016 at 7:24 AM, Joseph Allemandou jallemandou@wikimedia.org wrote:
Hey Oliver, It depends on what data you've used: if page_title or other 'encoding sensitive' data (I can't think of any other, but ...) is part of it, then yes, you should !
On Tue, Mar 1, 2016 at 3:27 PM, Oliver Keyes okeyes@wikimedia.org wrote:
Hey Joseph,
Thanks for letting us know. So we should delete and backfill last week's data, for our regularly scheduled scripts?
On 1 March 2016 at 08:26, Joseph Allemandou <
jallemandou@wikimedia.org>
wrote:
Hi,
TL,DR: Please don't use hive / spark / hadoop before next week.
Last week the Analytics Team performed an upgrade to the Hadoop Cluster. It went reasonably well except for many of the hadoop processes
were
launched with a special option to NOT use utf-8 as default
encoding.
This issue caused trouble particularly in page title extraction and was detected last sunday (many kudos to the people having filled bugs
on
Analytics API about encoding :) We found the bug and fixed it yesterday, and backfill starts today, with the cluster recomputing every dataset starting 2016-02-23 onward. This means you shouldn't query last week data during this week,
first
because it is incorrect, and second because you'll curse the
cluster
for being too slow :)
We are sorry for the inconvenience. Don't hesitate to contact us if you have any question
-- Joseph Allemandou Data Engineer @ Wikimedia Foundation IRC: joal
Engineering mailing list Engineering@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/engineering
-- Oliver Keyes Count Logula Wikimedia Foundation
-- Joseph Allemandou Data Engineer @ Wikimedia Foundation IRC: joal
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Thanks Joseph. Am I correct in saying that the counts in pageviews are just the aggregated counts for decoded page titles from pagecounts-all-sites?
Bo
On Tue, Mar 1, 2016 at 1:39 PM, Joseph Allemandou jallemandou@wikimedia.org wrote:
Hi ! pagecounts are regenerated but shouldn't be impacted by the encoding issue since page_title is not decoded :) Files I expect to have changed are the new version of pageviews: http://dumps.wikimedia.org/other/pageviews/2016/2016-02/ Joseph
On Tue, Mar 1, 2016 at 9:52 PM, Bo Han bo.ning.han@gmail.com wrote:
Thanks!
Diffing the newly-uploaded files for 20160223-160000 and 20160223-170000 with the previously-uploaded ones shows that their contents are the same. Were the original pagecounts files at http://dumps.wikimedia.org/other/pagecounts-all-sites/2016/2016-02/ not corrupted? The backfill is referring to other data, I assume?
Bo
On Tue, Mar 1, 2016 at 11:26 AM, Andrew Otto otto@wikimedia.org wrote:
https://phabricator.wikimedia.org/T128295
On Tue, Mar 1, 2016 at 2:15 PM, Bo Han bo.ning.han@gmail.com wrote:
Hi,
Would you mind linking the bug fix here? I couldn't find it on phabricator.
Thanks, Bo
On Tue, Mar 1, 2016 at 7:24 AM, Joseph Allemandou jallemandou@wikimedia.org wrote:
Hey Oliver, It depends on what data you've used: if page_title or other 'encoding sensitive' data (I can't think of any other, but ...) is part of it, then yes, you should !
On Tue, Mar 1, 2016 at 3:27 PM, Oliver Keyes okeyes@wikimedia.org wrote:
Hey Joseph,
Thanks for letting us know. So we should delete and backfill last week's data, for our regularly scheduled scripts?
On 1 March 2016 at 08:26, Joseph Allemandou jallemandou@wikimedia.org wrote: > Hi, > > TL,DR: Please don't use hive / spark / hadoop before next week. > > Last week the Analytics Team performed an upgrade to the Hadoop > Cluster. > It went reasonably well except for many of the hadoop processes > were > launched with a special option to NOT use utf-8 as default > encoding. > This issue caused trouble particularly in page title extraction > and > was > detected last sunday (many kudos to the people having filled bugs > on > Analytics API about encoding :) > We found the bug and fixed it yesterday, and backfill starts > today, > with > the > cluster recomputing every dataset starting 2016-02-23 onward. > This means you shouldn't query last week data during this week, > first > because it is incorrect, and second because you'll curse the > cluster > for > being too slow :) > > We are sorry for the inconvenience. > Don't hesitate to contact us if you have any question > > > -- > Joseph Allemandou > Data Engineer @ Wikimedia Foundation > IRC: joal > > _______________________________________________ > Engineering mailing list > Engineering@lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/engineering >
-- Oliver Keyes Count Logula Wikimedia Foundation
-- Joseph Allemandou Data Engineer @ Wikimedia Foundation IRC: joal
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Joseph Allemandou Data Engineer @ Wikimedia Foundation IRC: joal
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Hi Again,
@Dan: We will indeed reload data into cassandra.
@Bo: Actually the two datasets are fairly different.
The one called pagecounts is slowly getting deprecated toward the one called pageview, defined by Research people at WMF: https://meta.wikimedia.org/wiki/Research:Page_view
The pageview dumps are actually a 'legacy format' view of the new pageview :)
Code for the legacy extraction: https://github.com/wikimedia/analytics-refinery/blob/master/oozie/pagecounts... Code for the new pageview definition: https://github.com/wikimedia/analytics-refinery-source/blob/master/refinery-... https://github.com/wikimedia/analytics-refinery-source/blob/master/refinery-core/src/main/java/org/wikimedia/analytics/refinery/core/PageviewDefinition.java
Thanks for the clarification, Joseph.
Bo
On Tue, Mar 1, 2016 at 2:02 PM, Joseph Allemandou jallemandou@wikimedia.org wrote:
Hi Again,
@Dan: We will indeed reload data into cassandra.
@Bo: Actually the two datasets are fairly different.
The one called pagecounts is slowly getting deprecated toward the one called pageview, defined by Research people at WMF: https://meta.wikimedia.org/wiki/Research:Page_view
The pageview dumps are actually a 'legacy format' view of the new pageview :)
Code for the legacy extraction: https://github.com/wikimedia/analytics-refinery/blob/master/oozie/pagecounts... Code for the new pageview definition: https://github.com/wikimedia/analytics-refinery-source/blob/master/refinery-...
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Thanks Joseph! Is it reasonable to assume that the aggregate data in projectview_hourly https://wikitech.wikimedia.org/wiki/Analytics/Data/Projectview_hourly has not been affected?
On Tue, Mar 1, 2016 at 7:24 AM, Joseph Allemandou <jallemandou@wikimedia.org
wrote:
Hey Oliver, It depends on what data you've used: if page_title or other 'encoding sensitive' data (I can't think of any other, but ...) is part of it, then yes, you should !
On Tue, Mar 1, 2016 at 3:27 PM, Oliver Keyes okeyes@wikimedia.org wrote:
Hey Joseph,
Thanks for letting us know. So we should delete and backfill last week's data, for our regularly scheduled scripts?
On 1 March 2016 at 08:26, Joseph Allemandou jallemandou@wikimedia.org wrote:
Hi,
TL,DR: Please don't use hive / spark / hadoop before next week.
Last week the Analytics Team performed an upgrade to the Hadoop Cluster. It went reasonably well except for many of the hadoop processes were launched with a special option to NOT use utf-8 as default encoding. This issue caused trouble particularly in page title extraction and was detected last sunday (many kudos to the people having filled bugs on Analytics API about encoding :) We found the bug and fixed it yesterday, and backfill starts today,
with the
cluster recomputing every dataset starting 2016-02-23 onward. This means you shouldn't query last week data during this week, first because it is incorrect, and second because you'll curse the cluster for being too slow :)
We are sorry for the inconvenience. Don't hesitate to contact us if you have any question
-- Joseph Allemandou Data Engineer @ Wikimedia Foundation IRC: joal
Engineering mailing list Engineering@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/engineering
-- Oliver Keyes Count Logula Wikimedia Foundation
-- *Joseph Allemandou* Data Engineer @ Wikimedia Foundation IRC: joal
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Hi Tilman, Your assumption is correct, you can trust projectview_hourly :)
On Wed, Mar 2, 2016 at 4:22 AM, Tilman Bayer tbayer@wikimedia.org wrote:
Thanks Joseph! Is it reasonable to assume that the aggregate data in projectview_hourly https://wikitech.wikimedia.org/wiki/Analytics/Data/Projectview_hourly has not been affected?
On Tue, Mar 1, 2016 at 7:24 AM, Joseph Allemandou < jallemandou@wikimedia.org> wrote:
Hey Oliver, It depends on what data you've used: if page_title or other 'encoding sensitive' data (I can't think of any other, but ...) is part of it, then yes, you should !
On Tue, Mar 1, 2016 at 3:27 PM, Oliver Keyes okeyes@wikimedia.org wrote:
Hey Joseph,
Thanks for letting us know. So we should delete and backfill last week's data, for our regularly scheduled scripts?
On 1 March 2016 at 08:26, Joseph Allemandou jallemandou@wikimedia.org wrote:
Hi,
TL,DR: Please don't use hive / spark / hadoop before next week.
Last week the Analytics Team performed an upgrade to the Hadoop
Cluster.
It went reasonably well except for many of the hadoop processes were launched with a special option to NOT use utf-8 as default encoding. This issue caused trouble particularly in page title extraction and was detected last sunday (many kudos to the people having filled bugs on Analytics API about encoding :) We found the bug and fixed it yesterday, and backfill starts today,
with the
cluster recomputing every dataset starting 2016-02-23 onward. This means you shouldn't query last week data during this week, first because it is incorrect, and second because you'll curse the cluster
for
being too slow :)
We are sorry for the inconvenience. Don't hesitate to contact us if you have any question
-- Joseph Allemandou Data Engineer @ Wikimedia Foundation IRC: joal
Engineering mailing list Engineering@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/engineering
-- Oliver Keyes Count Logula Wikimedia Foundation
-- *Joseph Allemandou* Data Engineer @ Wikimedia Foundation IRC: joal
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Tilman Bayer Senior Analyst Wikimedia Foundation IRC (Freenode): HaeB
Engineering mailing list Engineering@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/engineering
So: what is the planning for making sure this doesn't happen the next time around? :)
On Tue, Mar 1, 2016 at 5:26 AM, Joseph Allemandou <jallemandou@wikimedia.org
wrote:
Hi,
*TL,DR: Please don't use hive / spark / hadoop before next week.*
Last week the Analytics Team performed an upgrade to the Hadoop Cluster. It went reasonably well except for many of the hadoop processes were launched with a special option to NOT use utf-8 as default encoding. This issue caused trouble particularly in page title extraction and was detected last sunday (many kudos to the people having filled bugs on Analytics API about encoding :) We found the bug and fixed it yesterday, and backfill starts today, with the cluster recomputing every dataset starting 2016-02-23 onward. This means you shouldn't query last week data during this week, first because it is incorrect, and second because you'll curse the cluster for being too slow :)
We are sorry for the inconvenience. Don't hesitate to contact us if you have any question
-- *Joseph Allemandou* Data Engineer @ Wikimedia Foundation IRC: joal
Engineering mailing list Engineering@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/engineering
@Ori: Needs to be discussed with the team - My 2 cents
- Detection: possible to implement as part of one of the oozie jobs. we will compute number of pages having different page_title for the same uri_path (if high, not good). - Prevention: 2 things possible - Try to understand WHY this thing happened (very difficult I think, possibly related to weird state after upgrade) and ensure we don't fall into that state again - Force JVM file.encoding for every java process of the cluster (probably easier but not really easy not to forget anything)
I'd love to have your thoughts / ideas and discuss them with the team. Thanks
On Wed, Mar 2, 2016 at 10:53 AM, Ori Livneh ori@wikimedia.org wrote:
So: what is the planning for making sure this doesn't happen the next time around? :)
On Tue, Mar 1, 2016 at 5:26 AM, Joseph Allemandou < jallemandou@wikimedia.org> wrote:
Hi,
*TL,DR: Please don't use hive / spark / hadoop before next week.*
Last week the Analytics Team performed an upgrade to the Hadoop Cluster. It went reasonably well except for many of the hadoop processes were launched with a special option to NOT use utf-8 as default encoding. This issue caused trouble particularly in page title extraction and was detected last sunday (many kudos to the people having filled bugs on Analytics API about encoding :) We found the bug and fixed it yesterday, and backfill starts today, with the cluster recomputing every dataset starting 2016-02-23 onward. This means you shouldn't query last week data during this week, first because it is incorrect, and second because you'll curse the cluster for being too slow :)
We are sorry for the inconvenience. Don't hesitate to contact us if you have any question
-- *Joseph Allemandou* Data Engineer @ Wikimedia Foundation IRC: joal
Engineering mailing list Engineering@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/engineering
After meeting with the team: - Encoding issue was due to locale wrongly set on some machines (but we don't know why) - We will find a way to enforce file.encoding, first looking for a java-global way, if not feasible, a process-local way. - We will NOT spend computing resource on a job trying to detect this issue (too costly for occurrence probability, particularly if we force file.encoding) Cheers Joseph
On Wed, Mar 2, 2016 at 11:24 AM, Joseph Allemandou < jallemandou@wikimedia.org> wrote:
@Ori: Needs to be discussed with the team - My 2 cents
- Detection: possible to implement as part of one of the oozie jobs.
we will compute number of pages having different page_title for the same uri_path (if high, not good).
- Prevention: 2 things possible
think, possibly related to weird state after upgrade) and ensure we don't fall into that state again
- Try to understand WHY this thing happened (very difficult I
(probably easier but not really easy not to forget anything)
- Force JVM file.encoding for every java process of the cluster
I'd love to have your thoughts / ideas and discuss them with the team. Thanks
On Wed, Mar 2, 2016 at 10:53 AM, Ori Livneh ori@wikimedia.org wrote:
So: what is the planning for making sure this doesn't happen the next time around? :)
On Tue, Mar 1, 2016 at 5:26 AM, Joseph Allemandou < jallemandou@wikimedia.org> wrote:
Hi,
*TL,DR: Please don't use hive / spark / hadoop before next week.*
Last week the Analytics Team performed an upgrade to the Hadoop Cluster. It went reasonably well except for many of the hadoop processes were launched with a special option to NOT use utf-8 as default encoding. This issue caused trouble particularly in page title extraction and was detected last sunday (many kudos to the people having filled bugs on Analytics API about encoding :) We found the bug and fixed it yesterday, and backfill starts today, with the cluster recomputing every dataset starting 2016-02-23 onward. This means you shouldn't query last week data during this week, first because it is incorrect, and second because you'll curse the cluster for being too slow :)
We are sorry for the inconvenience. Don't hesitate to contact us if you have any question
-- *Joseph Allemandou* Data Engineer @ Wikimedia Foundation IRC: joal
Engineering mailing list Engineering@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/engineering
-- *Joseph Allemandou* Data Engineer @ Wikimedia Foundation IRC: joal
Hi, Quick follow-up: All data has been backfilled, you can get back to normal cluster activity :) Sorry for the inconvenience. Joseph
On Tue, Mar 1, 2016 at 2:26 PM, Joseph Allemandou <jallemandou@wikimedia.org
wrote:
Hi,
*TL,DR: Please don't use hive / spark / hadoop before next week.*
Last week the Analytics Team performed an upgrade to the Hadoop Cluster. It went reasonably well except for many of the hadoop processes were launched with a special option to NOT use utf-8 as default encoding. This issue caused trouble particularly in page title extraction and was detected last sunday (many kudos to the people having filled bugs on Analytics API about encoding :) We found the bug and fixed it yesterday, and backfill starts today, with the cluster recomputing every dataset starting 2016-02-23 onward. This means you shouldn't query last week data during this week, first because it is incorrect, and second because you'll curse the cluster for being too slow :)
We are sorry for the inconvenience. Don't hesitate to contact us if you have any question
-- *Joseph Allemandou* Data Engineer @ Wikimedia Foundation IRC: joal