Hadoop - Last week data needs to be backfilled - Analytics

List overview All Threads
Download

newer

Hadoop - Last week data needs to be backfilled

older

Spotify Kafka -> Google Pub/Sub...

New maintenance window Mar 4 1 - 4...

Joseph Allemandou

1 Mar 2016 1 Mar '16

9:26 p.m.

Hi,

*TL,DR: Please don't use hive / spark / hadoop before next week.*

Last week the Analytics Team performed an upgrade to the Hadoop Cluster. It went reasonably well except for many of the hadoop processes were launched with a special option to NOT use utf-8 as default encoding. This issue caused trouble particularly in page title extraction and was detected last sunday (many kudos to the people having filled bugs on Analytics API about encoding :) We found the bug and fixed it yesterday, and backfill starts today, with the cluster recomputing every dataset starting 2016-02-23 onward. This means you shouldn't query last week data during this week, first because it is incorrect, and second because you'll curse the cluster for being too slow :)

We are sorry for the inconvenience. Don't hesitate to contact us if you have any question

-- *Joseph Allemandou* Data Engineer @ Wikimedia Foundation IRC: joal

Attachments:

attachment.htm (text/html — 1.2 KB)

Show replies by date

Oliver Keyes

1 Mar 1 Mar

10:27 p.m.

New subject: [Engineering] Hadoop - Last week data needs to be backfilled

Hey Joseph,

Thanks for letting us know. So we should delete and backfill last week's data, for our regularly scheduled scripts?

On 1 March 2016 at 08:26, Joseph Allemandou jallemandou@wikimedia.org wrote:

...

Hi,

TL,DR: Please don't use hive / spark / hadoop before next week.

Last week the Analytics Team performed an upgrade to the Hadoop Cluster. It went reasonably well except for many of the hadoop processes were launched with a special option to NOT use utf-8 as default encoding. This issue caused trouble particularly in page title extraction and was detected last sunday (many kudos to the people having filled bugs on Analytics API about encoding :) We found the bug and fixed it yesterday, and backfill starts today, with the cluster recomputing every dataset starting 2016-02-23 onward. This means you shouldn't query last week data during this week, first because it is incorrect, and second because you'll curse the cluster for being too slow :)

We are sorry for the inconvenience. Don't hesitate to contact us if you have any question

-- Joseph Allemandou Data Engineer @ Wikimedia Foundation IRC: joal

Engineering mailing list Engineering@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/engineering

-- Oliver Keyes Count Logula Wikimedia Foundation

Joseph Allemandou

11:24 p.m.

New subject: [Engineering] Hadoop - Last week data needs to be backfilled

Hey Oliver, It depends on what data you've used: if page_title or other 'encoding sensitive' data (I can't think of any other, but ...) is part of it, then yes, you should !

On Tue, Mar 1, 2016 at 3:27 PM, Oliver Keyes okeyes@wikimedia.org wrote:

...

Hey Joseph,

Thanks for letting us know. So we should delete and backfill last week's data, for our regularly scheduled scripts?

On 1 March 2016 at 08:26, Joseph Allemandou jallemandou@wikimedia.org wrote:

...
Hi,

TL,DR: Please don't use hive / spark / hadoop before next week.

Last week the Analytics Team performed an upgrade to the Hadoop Cluster. It went reasonably well except for many of the hadoop processes were launched with a special option to NOT use utf-8 as default encoding. This issue caused trouble particularly in page title extraction and was detected last sunday (many kudos to the people having filled bugs on Analytics API about encoding :) We found the bug and fixed it yesterday, and backfill starts today, with

the

...
cluster recomputing every dataset starting 2016-02-23 onward. This means you shouldn't query last week data during this week, first because it is incorrect, and second because you'll curse the cluster for being too slow :)

We are sorry for the inconvenience. Don't hesitate to contact us if you have any question

-- Joseph Allemandou Data Engineer @ Wikimedia Foundation IRC: joal

Engineering mailing list Engineering@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/engineering

-- Oliver Keyes Count Logula Wikimedia Foundation

-- *Joseph Allemandou* Data Engineer @ Wikimedia Foundation IRC: joal

Bo Han

2 Mar 2 Mar

3:15 a.m.

New subject: [Engineering] Hadoop - Last week data needs to be backfilled

Hi,

Would you mind linking the bug fix here? I couldn't find it on phabricator.

Thanks, Bo

On Tue, Mar 1, 2016 at 7:24 AM, Joseph Allemandou jallemandou@wikimedia.org wrote:

...

Hey Oliver, It depends on what data you've used: if page_title or other 'encoding sensitive' data (I can't think of any other, but ...) is part of it, then yes, you should !

On Tue, Mar 1, 2016 at 3:27 PM, Oliver Keyes okeyes@wikimedia.org wrote:

...
Hey Joseph,

Thanks for letting us know. So we should delete and backfill last week's data, for our regularly scheduled scripts?

On 1 March 2016 at 08:26, Joseph Allemandou jallemandou@wikimedia.org wrote:

...
Hi,

TL,DR: Please don't use hive / spark / hadoop before next week.

Last week the Analytics Team performed an upgrade to the Hadoop Cluster. It went reasonably well except for many of the hadoop processes were launched with a special option to NOT use utf-8 as default encoding. This issue caused trouble particularly in page title extraction and was detected last sunday (many kudos to the people having filled bugs on Analytics API about encoding :) We found the bug and fixed it yesterday, and backfill starts today, with the cluster recomputing every dataset starting 2016-02-23 onward. This means you shouldn't query last week data during this week, first because it is incorrect, and second because you'll curse the cluster for being too slow :)

We are sorry for the inconvenience. Don't hesitate to contact us if you have any question

-- Joseph Allemandou Data Engineer @ Wikimedia Foundation IRC: joal

Engineering mailing list Engineering@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/engineering

-- Oliver Keyes Count Logula Wikimedia Foundation

-- Joseph Allemandou Data Engineer @ Wikimedia Foundation IRC: joal

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Andrew Otto

3:26 a.m.

New subject: [Engineering] Hadoop - Last week data needs to be backfilled

https://phabricator.wikimedia.org/T128295

On Tue, Mar 1, 2016 at 2:15 PM, Bo Han bo.ning.han@gmail.com wrote:

...

Hi,

Would you mind linking the bug fix here? I couldn't find it on phabricator.

Thanks, Bo

On Tue, Mar 1, 2016 at 7:24 AM, Joseph Allemandou jallemandou@wikimedia.org wrote:

...
Hey Oliver, It depends on what data you've used: if page_title or other 'encoding sensitive' data (I can't think of any other, but ...) is part of it, then yes, you should !

On Tue, Mar 1, 2016 at 3:27 PM, Oliver Keyes okeyes@wikimedia.org

wrote:

...
...
Hey Joseph,

Thanks for letting us know. So we should delete and backfill last week's data, for our regularly scheduled scripts?

On 1 March 2016 at 08:26, Joseph Allemandou jallemandou@wikimedia.org wrote:

...
Hi,

TL,DR: Please don't use hive / spark / hadoop before next week.

Last week the Analytics Team performed an upgrade to the Hadoop

Cluster.

...
...
...
It went reasonably well except for many of the hadoop processes were launched with a special option to NOT use utf-8 as default encoding. This issue caused trouble particularly in page title extraction and

was

...
...
...
detected last sunday (many kudos to the people having filled bugs on Analytics API about encoding :) We found the bug and fixed it yesterday, and backfill starts today,

with

...
...
...
the cluster recomputing every dataset starting 2016-02-23 onward. This means you shouldn't query last week data during this week, first because it is incorrect, and second because you'll curse the cluster

for

...
...
...
being too slow :)

We are sorry for the inconvenience. Don't hesitate to contact us if you have any question

-- Joseph Allemandou Data Engineer @ Wikimedia Foundation IRC: joal

Engineering mailing list Engineering@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/engineering

-- Oliver Keyes Count Logula Wikimedia Foundation

-- Joseph Allemandou Data Engineer @ Wikimedia Foundation IRC: joal

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Bo Han

4:52 a.m.

New subject: [Engineering] Hadoop - Last week data needs to be backfilled

Thanks!

Diffing the newly-uploaded files for 20160223-160000 and 20160223-170000 with the previously-uploaded ones shows that their contents are the same. Were the original pagecounts files at http://dumps.wikimedia.org/other/pagecounts-all-sites/2016/2016-02/ not corrupted? The backfill is referring to other data, I assume?

On Tue, Mar 1, 2016 at 11:26 AM, Andrew Otto otto@wikimedia.org wrote:

...

https://phabricator.wikimedia.org/T128295

On Tue, Mar 1, 2016 at 2:15 PM, Bo Han bo.ning.han@gmail.com wrote:

...
Hi,

Would you mind linking the bug fix here? I couldn't find it on phabricator.

Thanks, Bo

On Tue, Mar 1, 2016 at 7:24 AM, Joseph Allemandou jallemandou@wikimedia.org wrote:

...
Hey Oliver, It depends on what data you've used: if page_title or other 'encoding sensitive' data (I can't think of any other, but ...) is part of it, then yes, you should !

On Tue, Mar 1, 2016 at 3:27 PM, Oliver Keyes okeyes@wikimedia.org wrote:

...
Hey Joseph,

Thanks for letting us know. So we should delete and backfill last week's data, for our regularly scheduled scripts?

On 1 March 2016 at 08:26, Joseph Allemandou jallemandou@wikimedia.org wrote:

...
Hi,

TL,DR: Please don't use hive / spark / hadoop before next week.

Last week the Analytics Team performed an upgrade to the Hadoop Cluster. It went reasonably well except for many of the hadoop processes were launched with a special option to NOT use utf-8 as default encoding. This issue caused trouble particularly in page title extraction and was detected last sunday (many kudos to the people having filled bugs on Analytics API about encoding :) We found the bug and fixed it yesterday, and backfill starts today, with the cluster recomputing every dataset starting 2016-02-23 onward. This means you shouldn't query last week data during this week, first because it is incorrect, and second because you'll curse the cluster for being too slow :)

We are sorry for the inconvenience. Don't hesitate to contact us if you have any question

-- Joseph Allemandou Data Engineer @ Wikimedia Foundation IRC: joal

Engineering mailing list Engineering@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/engineering

-- Oliver Keyes Count Logula Wikimedia Foundation

-- Joseph Allemandou Data Engineer @ Wikimedia Foundation IRC: joal

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Joseph Allemandou

5:39 a.m.

New subject: [Engineering] Hadoop - Last week data needs to be backfilled

Hi ! pagecounts are regenerated but shouldn't be impacted by the encoding issue since page_title is not decoded :) Files I expect to have changed are the new version of pageviews: http://dumps.wikimedia.org/other/pageviews/2016/2016-02/ Joseph

On Tue, Mar 1, 2016 at 9:52 PM, Bo Han bo.ning.han@gmail.com wrote:

...

Thanks!

Diffing the newly-uploaded files for 20160223-160000 and 20160223-170000 with the previously-uploaded ones shows that their contents are the same. Were the original pagecounts files at http://dumps.wikimedia.org/other/pagecounts-all-sites/2016/2016-02/ not corrupted? The backfill is referring to other data, I assume?

Bo

On Tue, Mar 1, 2016 at 11:26 AM, Andrew Otto otto@wikimedia.org wrote:

...
https://phabricator.wikimedia.org/T128295

On Tue, Mar 1, 2016 at 2:15 PM, Bo Han bo.ning.han@gmail.com wrote:

...
Hi,

Would you mind linking the bug fix here? I couldn't find it on phabricator.

Thanks, Bo

On Tue, Mar 1, 2016 at 7:24 AM, Joseph Allemandou jallemandou@wikimedia.org wrote:

...
Hey Oliver, It depends on what data you've used: if page_title or other 'encoding sensitive' data (I can't think of any other, but ...) is part of it, then yes, you should !

On Tue, Mar 1, 2016 at 3:27 PM, Oliver Keyes okeyes@wikimedia.org wrote:

...
Hey Joseph,

Thanks for letting us know. So we should delete and backfill last week's data, for our regularly scheduled scripts?

On 1 March 2016 at 08:26, Joseph Allemandou <

jallemandou@wikimedia.org>

...
...
...
...
wrote:

...
Hi,

TL,DR: Please don't use hive / spark / hadoop before next week.

Last week the Analytics Team performed an upgrade to the Hadoop Cluster. It went reasonably well except for many of the hadoop processes

were

...
...
...
...
...
launched with a special option to NOT use utf-8 as default

encoding.

...
...
...
...
...
This issue caused trouble particularly in page title extraction and was detected last sunday (many kudos to the people having filled bugs

on

...
...
...
...
...
Analytics API about encoding :) We found the bug and fixed it yesterday, and backfill starts today, with the cluster recomputing every dataset starting 2016-02-23 onward. This means you shouldn't query last week data during this week,

first

...
...
...
...
...
because it is incorrect, and second because you'll curse the

cluster

...
...
...
...
...
for being too slow :)

We are sorry for the inconvenience. Don't hesitate to contact us if you have any question

-- Joseph Allemandou Data Engineer @ Wikimedia Foundation IRC: joal

Engineering mailing list Engineering@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/engineering

-- Oliver Keyes Count Logula Wikimedia Foundation

-- Joseph Allemandou Data Engineer @ Wikimedia Foundation IRC: joal

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

-- *Joseph Allemandou* Data Engineer @ Wikimedia Foundation IRC: joal

Dan Andreescu

5:44 a.m.

New subject: [Engineering] Hadoop - Last week data needs to be backfilled

Bo Han

5:51 a.m.

New subject: [Engineering] Hadoop - Last week data needs to be backfilled

Thanks Joseph. Am I correct in saying that the counts in pageviews are just the aggregated counts for decoded page titles from pagecounts-all-sites?

On Tue, Mar 1, 2016 at 1:39 PM, Joseph Allemandou jallemandou@wikimedia.org wrote:

...

Hi ! pagecounts are regenerated but shouldn't be impacted by the encoding issue since page_title is not decoded :) Files I expect to have changed are the new version of pageviews: http://dumps.wikimedia.org/other/pageviews/2016/2016-02/ Joseph

On Tue, Mar 1, 2016 at 9:52 PM, Bo Han bo.ning.han@gmail.com wrote:

...
Thanks!

Diffing the newly-uploaded files for 20160223-160000 and 20160223-170000 with the previously-uploaded ones shows that their contents are the same. Were the original pagecounts files at http://dumps.wikimedia.org/other/pagecounts-all-sites/2016/2016-02/ not corrupted? The backfill is referring to other data, I assume?

Bo

On Tue, Mar 1, 2016 at 11:26 AM, Andrew Otto otto@wikimedia.org wrote:

...
https://phabricator.wikimedia.org/T128295

On Tue, Mar 1, 2016 at 2:15 PM, Bo Han bo.ning.han@gmail.com wrote:

...
Hi,

Would you mind linking the bug fix here? I couldn't find it on phabricator.

Thanks, Bo

On Tue, Mar 1, 2016 at 7:24 AM, Joseph Allemandou jallemandou@wikimedia.org wrote:

...
Hey Oliver, It depends on what data you've used: if page_title or other 'encoding sensitive' data (I can't think of any other, but ...) is part of it, then yes, you should !

On Tue, Mar 1, 2016 at 3:27 PM, Oliver Keyes okeyes@wikimedia.org wrote:

...
Hey Joseph,

Thanks for letting us know. So we should delete and backfill last week's data, for our regularly scheduled scripts?

On 1 March 2016 at 08:26, Joseph Allemandou jallemandou@wikimedia.org wrote: > Hi, > > TL,DR: Please don't use hive / spark / hadoop before next week. > > Last week the Analytics Team performed an upgrade to the Hadoop > Cluster. > It went reasonably well except for many of the hadoop processes > were > launched with a special option to NOT use utf-8 as default > encoding. > This issue caused trouble particularly in page title extraction > and > was > detected last sunday (many kudos to the people having filled bugs > on > Analytics API about encoding :) > We found the bug and fixed it yesterday, and backfill starts > today, > with > the > cluster recomputing every dataset starting 2016-02-23 onward. > This means you shouldn't query last week data during this week, > first > because it is incorrect, and second because you'll curse the > cluster > for > being too slow :) > > We are sorry for the inconvenience. > Don't hesitate to contact us if you have any question > > > -- > Joseph Allemandou > Data Engineer @ Wikimedia Foundation > IRC: joal > > _______________________________________________ > Engineering mailing list > Engineering@lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/engineering >

-- Oliver Keyes Count Logula Wikimedia Foundation

-- Joseph Allemandou Data Engineer @ Wikimedia Foundation IRC: joal

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

-- Joseph Allemandou Data Engineer @ Wikimedia Foundation IRC: joal

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Joseph Allemandou

6:02 a.m.

New subject: [Engineering] Hadoop - Last week data needs to be backfilled

Hi Again,

@Dan: We will indeed reload data into cassandra.

@Bo: Actually the two datasets are fairly different.

The one called pagecounts is slowly getting deprecated toward the one called pageview, defined by Research people at WMF: https://meta.wikimedia.org/wiki/Research:Page_view

The pageview dumps are actually a 'legacy format' view of the new pageview :)

Code for the legacy extraction: https://github.com/wikimedia/analytics-refinery/blob/master/oozie/pagecounts... Code for the new pageview definition: https://github.com/wikimedia/analytics-refinery-source/blob/master/refinery-... https://github.com/wikimedia/analytics-refinery-source/blob/master/refinery-core/src/main/java/org/wikimedia/analytics/refinery/core/PageviewDefinition.java

Bo Han

7:42 a.m.

New subject: [Engineering] Hadoop - Last week data needs to be backfilled

Thanks for the clarification, Joseph.

On Tue, Mar 1, 2016 at 2:02 PM, Joseph Allemandou jallemandou@wikimedia.org wrote:

...

Hi Again,

@Dan: We will indeed reload data into cassandra.

@Bo: Actually the two datasets are fairly different.

The one called pagecounts is slowly getting deprecated toward the one called pageview, defined by Research people at WMF: https://meta.wikimedia.org/wiki/Research:Page_view

The pageview dumps are actually a 'legacy format' view of the new pageview :)

Code for the legacy extraction: https://github.com/wikimedia/analytics-refinery/blob/master/oozie/pagecounts... Code for the new pageview definition: https://github.com/wikimedia/analytics-refinery-source/blob/master/refinery-...

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Tilman Bayer

11:22 a.m.

New subject: [Engineering] Hadoop - Last week data needs to be backfilled

Thanks Joseph! Is it reasonable to assume that the aggregate data in projectview_hourly https://wikitech.wikimedia.org/wiki/Analytics/Data/Projectview_hourly has not been affected?

On Tue, Mar 1, 2016 at 7:24 AM, Joseph Allemandou <jallemandou@wikimedia.org

...

wrote:

...

Hey Oliver, It depends on what data you've used: if page_title or other 'encoding sensitive' data (I can't think of any other, but ...) is part of it, then yes, you should !

On Tue, Mar 1, 2016 at 3:27 PM, Oliver Keyes okeyes@wikimedia.org wrote:

...
Hey Joseph,

Thanks for letting us know. So we should delete and backfill last week's data, for our regularly scheduled scripts?

On 1 March 2016 at 08:26, Joseph Allemandou jallemandou@wikimedia.org wrote:

...
Hi,

TL,DR: Please don't use hive / spark / hadoop before next week.

Last week the Analytics Team performed an upgrade to the Hadoop Cluster. It went reasonably well except for many of the hadoop processes were launched with a special option to NOT use utf-8 as default encoding. This issue caused trouble particularly in page title extraction and was detected last sunday (many kudos to the people having filled bugs on Analytics API about encoding :) We found the bug and fixed it yesterday, and backfill starts today,

with the

...
cluster recomputing every dataset starting 2016-02-23 onward. This means you shouldn't query last week data during this week, first because it is incorrect, and second because you'll curse the cluster for being too slow :)

We are sorry for the inconvenience. Don't hesitate to contact us if you have any question

-- Joseph Allemandou Data Engineer @ Wikimedia Foundation IRC: joal

Engineering mailing list Engineering@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/engineering

-- Oliver Keyes Count Logula Wikimedia Foundation

-- *Joseph Allemandou* Data Engineer @ Wikimedia Foundation IRC: joal

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

-- Tilman Bayer Senior Analyst Wikimedia Foundation IRC (Freenode): HaeB

Joseph Allemandou

5:27 p.m.

New subject: [Engineering] Hadoop - Last week data needs to be backfilled

Hi Tilman, Your assumption is correct, you can trust projectview_hourly :)

On Wed, Mar 2, 2016 at 4:22 AM, Tilman Bayer tbayer@wikimedia.org wrote:

...

Thanks Joseph! Is it reasonable to assume that the aggregate data in projectview_hourly https://wikitech.wikimedia.org/wiki/Analytics/Data/Projectview_hourly has not been affected?

On Tue, Mar 1, 2016 at 7:24 AM, Joseph Allemandou < jallemandou@wikimedia.org> wrote:

...
Hey Oliver, It depends on what data you've used: if page_title or other 'encoding sensitive' data (I can't think of any other, but ...) is part of it, then yes, you should !

On Tue, Mar 1, 2016 at 3:27 PM, Oliver Keyes okeyes@wikimedia.org wrote:

...
Hey Joseph,

Thanks for letting us know. So we should delete and backfill last week's data, for our regularly scheduled scripts?

On 1 March 2016 at 08:26, Joseph Allemandou jallemandou@wikimedia.org wrote:

...
Hi,

TL,DR: Please don't use hive / spark / hadoop before next week.

Last week the Analytics Team performed an upgrade to the Hadoop

Cluster.

...
It went reasonably well except for many of the hadoop processes were launched with a special option to NOT use utf-8 as default encoding. This issue caused trouble particularly in page title extraction and was detected last sunday (many kudos to the people having filled bugs on Analytics API about encoding :) We found the bug and fixed it yesterday, and backfill starts today,

with the

...
cluster recomputing every dataset starting 2016-02-23 onward. This means you shouldn't query last week data during this week, first because it is incorrect, and second because you'll curse the cluster

for

...
being too slow :)

We are sorry for the inconvenience. Don't hesitate to contact us if you have any question

-- Joseph Allemandou Data Engineer @ Wikimedia Foundation IRC: joal

Engineering mailing list Engineering@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/engineering

-- Oliver Keyes Count Logula Wikimedia Foundation

-- *Joseph Allemandou* Data Engineer @ Wikimedia Foundation IRC: joal

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

-- Tilman Bayer Senior Analyst Wikimedia Foundation IRC (Freenode): HaeB

Engineering mailing list Engineering@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/engineering

-- *Joseph Allemandou* Data Engineer @ Wikimedia Foundation IRC: joal

Ori Livneh

5:53 p.m.

New subject: [Engineering] Hadoop - Last week data needs to be backfilled

So: what is the planning for making sure this doesn't happen the next time around? :)

On Tue, Mar 1, 2016 at 5:26 AM, Joseph Allemandou <jallemandou@wikimedia.org

...

wrote:

...

Hi,

*TL,DR: Please don't use hive / spark / hadoop before next week.*

Last week the Analytics Team performed an upgrade to the Hadoop Cluster. It went reasonably well except for many of the hadoop processes were launched with a special option to NOT use utf-8 as default encoding. This issue caused trouble particularly in page title extraction and was detected last sunday (many kudos to the people having filled bugs on Analytics API about encoding :) We found the bug and fixed it yesterday, and backfill starts today, with the cluster recomputing every dataset starting 2016-02-23 onward. This means you shouldn't query last week data during this week, first because it is incorrect, and second because you'll curse the cluster for being too slow :)

We are sorry for the inconvenience. Don't hesitate to contact us if you have any question

-- *Joseph Allemandou* Data Engineer @ Wikimedia Foundation IRC: joal

Engineering mailing list Engineering@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/engineering

Joseph Allemandou

6:24 p.m.

New subject: [Engineering] Hadoop - Last week data needs to be backfilled

@Ori: Needs to be discussed with the team - My 2 cents

- Detection: possible to implement as part of one of the oozie jobs. we will compute number of pages having different page_title for the same uri_path (if high, not good). - Prevention: 2 things possible - Try to understand WHY this thing happened (very difficult I think, possibly related to weird state after upgrade) and ensure we don't fall into that state again - Force JVM file.encoding for every java process of the cluster (probably easier but not really easy not to forget anything)

I'd love to have your thoughts / ideas and discuss them with the team. Thanks

On Wed, Mar 2, 2016 at 10:53 AM, Ori Livneh ori@wikimedia.org wrote:

...

So: what is the planning for making sure this doesn't happen the next time around? :)

On Tue, Mar 1, 2016 at 5:26 AM, Joseph Allemandou < jallemandou@wikimedia.org> wrote:

...
Hi,

*TL,DR: Please don't use hive / spark / hadoop before next week.*

Last week the Analytics Team performed an upgrade to the Hadoop Cluster. It went reasonably well except for many of the hadoop processes were launched with a special option to NOT use utf-8 as default encoding. This issue caused trouble particularly in page title extraction and was detected last sunday (many kudos to the people having filled bugs on Analytics API about encoding :) We found the bug and fixed it yesterday, and backfill starts today, with the cluster recomputing every dataset starting 2016-02-23 onward. This means you shouldn't query last week data during this week, first because it is incorrect, and second because you'll curse the cluster for being too slow :)

We are sorry for the inconvenience. Don't hesitate to contact us if you have any question

-- *Joseph Allemandou* Data Engineer @ Wikimedia Foundation IRC: joal

Engineering mailing list Engineering@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/engineering

-- *Joseph Allemandou* Data Engineer @ Wikimedia Foundation IRC: joal

Joseph Allemandou

3 Mar 3 Mar

1:52 a.m.

New subject: [Engineering] Hadoop - Last week data needs to be backfilled

After meeting with the team: - Encoding issue was due to locale wrongly set on some machines (but we don't know why) - We will find a way to enforce file.encoding, first looking for a java-global way, if not feasible, a process-local way. - We will NOT spend computing resource on a job trying to detect this issue (too costly for occurrence probability, particularly if we force file.encoding) Cheers Joseph

On Wed, Mar 2, 2016 at 11:24 AM, Joseph Allemandou < jallemandou@wikimedia.org> wrote:

...

@Ori: Needs to be discussed with the team - My 2 cents

Detection: possible to implement as part of one of the oozie jobs.

we will compute number of pages having different page_title for the same uri_path (if high, not good).

Prevention: 2 things possible

Try to understand WHY this thing happened (very difficult I

think, possibly related to weird state after upgrade) and ensure we don't fall into that state again

Force JVM file.encoding for every java process of the cluster

(probably easier but not really easy not to forget anything)

I'd love to have your thoughts / ideas and discuss them with the team. Thanks

On Wed, Mar 2, 2016 at 10:53 AM, Ori Livneh ori@wikimedia.org wrote:

...
So: what is the planning for making sure this doesn't happen the next time around? :)

On Tue, Mar 1, 2016 at 5:26 AM, Joseph Allemandou < jallemandou@wikimedia.org> wrote:

...
Hi,

*TL,DR: Please don't use hive / spark / hadoop before next week.*

Last week the Analytics Team performed an upgrade to the Hadoop Cluster. It went reasonably well except for many of the hadoop processes were launched with a special option to NOT use utf-8 as default encoding. This issue caused trouble particularly in page title extraction and was detected last sunday (many kudos to the people having filled bugs on Analytics API about encoding :) We found the bug and fixed it yesterday, and backfill starts today, with the cluster recomputing every dataset starting 2016-02-23 onward. This means you shouldn't query last week data during this week, first because it is incorrect, and second because you'll curse the cluster for being too slow :)

We are sorry for the inconvenience. Don't hesitate to contact us if you have any question

-- *Joseph Allemandou* Data Engineer @ Wikimedia Foundation IRC: joal

Engineering mailing list Engineering@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/engineering

-- *Joseph Allemandou* Data Engineer @ Wikimedia Foundation IRC: joal

-- *Joseph Allemandou* Data Engineer @ Wikimedia Foundation IRC: joal

Joseph Allemandou

7 Mar 7 Mar

6:11 p.m.

Hi, Quick follow-up: All data has been backfilled, you can get back to normal cluster activity :) Sorry for the inconvenience. Joseph

On Tue, Mar 1, 2016 at 2:26 PM, Joseph Allemandou <jallemandou@wikimedia.org

...

wrote:

...

Hi,

*TL,DR: Please don't use hive / spark / hadoop before next week.*

Last week the Analytics Team performed an upgrade to the Hadoop Cluster. It went reasonably well except for many of the hadoop processes were launched with a special option to NOT use utf-8 as default encoding. This issue caused trouble particularly in page title extraction and was detected last sunday (many kudos to the people having filled bugs on Analytics API about encoding :) We found the bug and fixed it yesterday, and backfill starts today, with the cluster recomputing every dataset starting 2016-02-23 onward. This means you shouldn't query last week data during this week, first because it is incorrect, and second because you'll curse the cluster for being too slow :)

We are sorry for the inconvenience. Don't hesitate to contact us if you have any question

-- *Joseph Allemandou* Data Engineer @ Wikimedia Foundation IRC: joal

-- *Joseph Allemandou* Data Engineer @ Wikimedia Foundation IRC: joal

3184

Age (days ago)

3190

Last active (days ago)

analytics@lists.wikimedia.org

16 comments

7 participants

tags (0)

participants (7)

Andrew Otto
Bo Han
Dan Andreescu
Joseph Allemandou
Oliver Keyes
Ori Livneh
Tilman Bayer