Hi everybody,
On Monday 21st we'd like to reboot all stat100x hosts for Linux kernel upgrades at around 9 AM CET. This means that all the notebooks and various activities running on those nodes will be stopped for a brief amount of time. To repay your patience, two things will be added:
- A shared kerberos credential cache with notebooks. This practically means that you will only be required to kinit once (either after doing ssh to stat100x or in a Jupyter notebook), and the credentials will be shared (no more double kinit etc..). It is already "live" on stat1004 if you want to test it! Since the new shared credential will have a new location on disk, all kerberos sessions will be destroyed and you'll have to kinit again when the reboots are completed. More details in https://phabricator.wikimedia.org/T255262. - A new endpoint for Hive called 'analytics-hive.eqiad.wmnet', that should replace hive jdbc/metastore configs hardcoding an-coord1001.eqiad.wmnet (and allow us to failover transparently if needed without requesting job restarts etc..). The side effect of this is that all hive-related tools will change configs (transparently for external users). If you have any script that points directly to hive via JDBC (for example a Python script using PyHive etc..) please update it with the new endpoint.
If this schedule impacts your work, please ping me via email/IRC/etc.. and I'll try to reschedule accordingly :)
Thanks!
Luca (on behalf of Analytics / Data Engineering)
Thanks, Luca!
If you use wmfdata-python https://github.com/wikimedia/wmfdata-python or wmfdata-r https://github.com/wikimedia/wmfdata-r, they should *not *be affected by the changed Hive endpoint as they pick it up from the shell environment. However, if you notice anything breaking on Monday, please contact my team at product-analytics@wikimedia.org.
On Thu, 17 Dec 2020 at 14:31, Luca Toscano ltoscano@wikimedia.org wrote:
Hi everybody,
On Monday 21st we'd like to reboot all stat100x hosts for Linux kernel upgrades at around 9 AM CET. This means that all the notebooks and various activities running on those nodes will be stopped for a brief amount of time. To repay your patience, two things will be added:
- A shared kerberos credential cache with notebooks. This practically
means that you will only be required to kinit once (either after doing ssh to stat100x or in a Jupyter notebook), and the credentials will be shared (no more double kinit etc..). It is already "live" on stat1004 if you want to test it! Since the new shared credential will have a new location on disk, all kerberos sessions will be destroyed and you'll have to kinit again when the reboots are completed. More details in https://phabricator.wikimedia.org/T255262.
- A new endpoint for Hive called 'analytics-hive.eqiad.wmnet', that should
replace hive jdbc/metastore configs hardcoding an-coord1001.eqiad.wmnet (and allow us to failover transparently if needed without requesting job restarts etc..). The side effect of this is that all hive-related tools will change configs (transparently for external users). If you have any script that points directly to hive via JDBC (for example a Python script using PyHive etc..) please update it with the new endpoint.
If this schedule impacts your work, please ping me via email/IRC/etc.. and I'll try to reschedule accordingly :)
Thanks!
Luca (on behalf of Analytics / Data Engineering)
Analytics-announce mailing list Analytics-announce@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics-announce
Update: I'd like to move the maintenance to Tuesday 22nd at around 9 CET, I found some weird corner cases of the actual kerberos set up on stat1004 (tracked in https://phabricator.wikimedia.org/T255262) that I'd need to iron out with the SRE team before applying them to all stat100x.
Please let me know if this is a problem for you :)
Also updated https://wikitech.wikimedia.org/wiki/Analytics/Systems/Maintenance_Schedule
Luca (on behalf of Analytics / Data Engineering)
On Thu, Dec 17, 2020 at 10:27 AM Neil Shah-Quinn nshahquinn@wikimedia.org wrote:
Thanks, Luca!
If you use wmfdata-python https://github.com/wikimedia/wmfdata-python or wmfdata-r https://github.com/wikimedia/wmfdata-r, they should *not *be affected by the changed Hive endpoint as they pick it up from the shell environment. However, if you notice anything breaking on Monday, please contact my team at product-analytics@wikimedia.org.
On Thu, 17 Dec 2020 at 14:31, Luca Toscano ltoscano@wikimedia.org wrote:
Hi everybody,
On Monday 21st we'd like to reboot all stat100x hosts for Linux kernel upgrades at around 9 AM CET. This means that all the notebooks and various activities running on those nodes will be stopped for a brief amount of time. To repay your patience, two things will be added:
- A shared kerberos credential cache with notebooks. This practically
means that you will only be required to kinit once (either after doing ssh to stat100x or in a Jupyter notebook), and the credentials will be shared (no more double kinit etc..). It is already "live" on stat1004 if you want to test it! Since the new shared credential will have a new location on disk, all kerberos sessions will be destroyed and you'll have to kinit again when the reboots are completed. More details in https://phabricator.wikimedia.org/T255262.
- A new endpoint for Hive called 'analytics-hive.eqiad.wmnet', that
should replace hive jdbc/metastore configs hardcoding an-coord1001.eqiad.wmnet (and allow us to failover transparently if needed without requesting job restarts etc..). The side effect of this is that all hive-related tools will change configs (transparently for external users). If you have any script that points directly to hive via JDBC (for example a Python script using PyHive etc..) please update it with the new endpoint.
If this schedule impacts your work, please ping me via email/IRC/etc.. and I'll try to reschedule accordingly :)
Thanks!
Luca (on behalf of Analytics / Data Engineering)
Analytics-announce mailing list Analytics-announce@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics-announce
-- Neil Shah-Quinn senior data scientist, Product Analytics https://www.mediawiki.org/wiki/Product_Analytics Wikimedia Foundation https://wikimediafoundation.org/ -- Analytics-announce mailing list Analytics-announce@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics-announce
Hi everybody,
The maintenance is done but sadly I wasn't able to apply the new kerberos credential location (to avoid the double kinit ssh vs notebook) because some problem arose while testing on stat1004 (thanks a lot to Isaac for all the tests!). I am tracking the work in https://phabricator.wikimedia.org/T255262, hopefully in early January I'll be able to have something more solid :)
Happy holidays!
Luca
On Fri, Dec 18, 2020 at 12:14 PM Luca Toscano ltoscano@wikimedia.org wrote:
Update: I'd like to move the maintenance to Tuesday 22nd at around 9 CET, I found some weird corner cases of the actual kerberos set up on stat1004 (tracked in https://phabricator.wikimedia.org/T255262) that I'd need to iron out with the SRE team before applying them to all stat100x.
Please let me know if this is a problem for you :)
Also updated https://wikitech.wikimedia.org/wiki/Analytics/Systems/Maintenance_Schedule
Luca (on behalf of Analytics / Data Engineering)
On Thu, Dec 17, 2020 at 10:27 AM Neil Shah-Quinn nshahquinn@wikimedia.org wrote:
Thanks, Luca!
If you use wmfdata-python https://github.com/wikimedia/wmfdata-python or wmfdata-r https://github.com/wikimedia/wmfdata-r, they should *not *be affected by the changed Hive endpoint as they pick it up from the shell environment. However, if you notice anything breaking on Monday, please contact my team at product-analytics@wikimedia.org.
On Thu, 17 Dec 2020 at 14:31, Luca Toscano ltoscano@wikimedia.org wrote:
Hi everybody,
On Monday 21st we'd like to reboot all stat100x hosts for Linux kernel upgrades at around 9 AM CET. This means that all the notebooks and various activities running on those nodes will be stopped for a brief amount of time. To repay your patience, two things will be added:
- A shared kerberos credential cache with notebooks. This practically
means that you will only be required to kinit once (either after doing ssh to stat100x or in a Jupyter notebook), and the credentials will be shared (no more double kinit etc..). It is already "live" on stat1004 if you want to test it! Since the new shared credential will have a new location on disk, all kerberos sessions will be destroyed and you'll have to kinit again when the reboots are completed. More details in https://phabricator.wikimedia.org/T255262.
- A new endpoint for Hive called 'analytics-hive.eqiad.wmnet', that
should replace hive jdbc/metastore configs hardcoding an-coord1001.eqiad.wmnet (and allow us to failover transparently if needed without requesting job restarts etc..). The side effect of this is that all hive-related tools will change configs (transparently for external users). If you have any script that points directly to hive via JDBC (for example a Python script using PyHive etc..) please update it with the new endpoint.
If this schedule impacts your work, please ping me via email/IRC/etc.. and I'll try to reschedule accordingly :)
Thanks!
Luca (on behalf of Analytics / Data Engineering)
Analytics-announce mailing list Analytics-announce@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics-announce
-- Neil Shah-Quinn senior data scientist, Product Analytics https://www.mediawiki.org/wiki/Product_Analytics Wikimedia Foundation https://wikimediafoundation.org/ -- Analytics-announce mailing list Analytics-announce@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics-announce
analytics-announce@lists.wikimedia.org