Hello everyone, there will be a leap second on June 30th 23:59:60 UTC ( https://en.wikipedia.org/wiki/Leap_second) The adjustment of the leap second is normally handled as part of the NTP time synchonisation protocol (which all our hosts run): On the day of a leap second, NTP notifies the Linux kernel about the leap second and the kernel adds it after 23:59:59. (The insertion of the kernel flag that the day has a leap second happens well before 23:59:59, specifically at 00:00:00 of the same day.)
The last time a leap second was added, happened in 2012: Bugs in the Linux kernel and Java caused problems on a number of services (especially Search and LDAP). The underlying bugs in the Linux kernel and OpenJDK have been fixed and the updates are present across all systems, but due to leap seconds only occurring every few years, we might encounter totally new bugs when the leap second is added at midnight. (We made sure that the existing kernel bug from 2012 is fixed using test cases). There might also be problems in applications or language interpreters which bail on the existence of a 23:59:60 time.
As such, we've decided to follow a safe approach, outlined below: * NTP will be disabled on the 29th for most systems in production. We'll keep a few non-critical systems on the default behaviour/NTP, with the intent of gathering data/experiences for future leap seconds, without however compomising the stability of the infrastructure. The other systems will run on hardware-clock starting then. The hardware clocks shouldn't deviate significantly over the approx. two day period, but if you're concerned that a service will cause problems w/o clocks being precisely in sync, please get in touch with ops@. Do note that deviations should be in the order of milliseconds.
* Most systems in Labs will also get NTP disabled.
* Our NTP servers will keep synching with their upstream peers.
* On the 1st of July we'll re-enable NTP in batches. System clocks will move forward by a second once NTP is started again, but acting on normal clock changes should be something every application gets along with just fine.
Alexandros / Moritz
Alex and Moritz,
thank you for taking care of this. These leap seconds are a real pain in the butt for time-based distributed systems, and I'm glad that we have a plan in place. I hope the movement to abolish leap seconds https://en.wikipedia.org/wiki/Leap_second#Proposal_to_abolish_leap_seconds wins out in the end!
On Thu, Jun 25, 2015 at 7:27 AM, Moritz Mühlenhoff moritz@wikimedia.org wrote:
- On the 1st of July we'll re-enable NTP in batches. System clocks will
move forward by a second once NTP is started again,
To clarify: By default, system time will move *backwards* one second.
We just talked about this on IRC, so just for other's benefit: With NTP's -x option we should be able to smear the adjustment (by slowing down the system clock temporarily) until the leap second is incorporated into the system time. This avoids non-monotonicity, which is important for systems that use time to capture causality. It would be great to apply the adjustment to all nodes of the cassandra cluster at once, so that their clocks are being slewed in lock-step.
Gabriel
All staff members, in preparation for leap second 2015, should watch a brief training on the other problems with time & timezones[1].
[1] https://www.youtube.com/watch?v=-5wpm-gesOY
On Thu, Jun 25, 2015 at 12:05 PM Gabriel Wicke gwicke@wikimedia.org wrote:
Alex and Moritz,
thank you for taking care of this. These leap seconds are a real pain in the butt for time-based distributed systems, and I'm glad that we have a plan in place. I hope the movement to abolish leap seconds < https://en.wikipedia.org/wiki/Leap_second#Proposal_to_abolish_leap_seconds
wins out in the end!
On Thu, Jun 25, 2015 at 7:27 AM, Moritz Mühlenhoff moritz@wikimedia.org wrote:
- On the 1st of July we'll re-enable NTP in batches. System clocks will
move forward by a second once NTP is started again,
To clarify: By default, system time will move *backwards* one second.
We just talked about this on IRC, so just for other's benefit: With NTP's -x option we should be able to smear the adjustment (by slowing down the system clock temporarily) until the leap second is incorporated into the system time. This avoids non-monotonicity, which is important for systems that use time to capture causality. It would be great to apply the adjustment to all nodes of the cassandra cluster at once, so that their clocks are being slewed in lock-step.
Gabriel _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Hi Gabriel,
We just talked about this on IRC, so just for other's benefit: With NTP's
-x option we should be able to smear the adjustment (by slowing down the system clock temporarily) until the leap second is incorporated into the system time. This avoids non-monotonicity, which is important for systems that use time to capture causality. It would be great to apply the adjustment to all nodes of the cassandra cluster at once, so that their clocks are being slewed in lock-step.
Unfortunately this was broken in NTP 4.2.6 and only recently discovered: http://bugs.ntp.org/show_bug.cgi?id=2745 (only fixed in current development/pre-releases). Even if we would backport the fixes to our time servers we'd run into problems, since the local time deviation in the "normalisation period" wouldn't be consistent across the nodes of the Cassandra cluster.
(chrony fully supports NTP smearing since the 2.0 release (27th April 2015), but that's also not a solution for the upcoming leap second).
We'll follow up in a separate mail how to best accomodate the Cassandra cluster.
Cheers, Moritz
wikitech-l@lists.wikimedia.org