We've had a number of NTP and clock based anomalies over approximately the past 12 to 18 hours. Upon further investigation, as unlikely as this might seem, we think the Hypervisor may be presenting an incorrect time to the guest.
Whilst I'll discuss a specific host in this ticket we've seen it in multiple places over this time period. All examples have been in eu-west-1, have been spot instances, although the instance type varies.
Consider that /sys/devices/system/clocksource0/current_clocksource
returns xen
.
With no ntp daemon running the following can be observed:
# ntpdate 0.amazon.pool.ntp.org && sleep 900 && ntpdate 0.amazon.pool.ntp.org
13 Dec 13:42:04 ntpdate[4889]: step time server 4.53.160.75 offset 4.531351 sec
13 Dec 13:57:29 ntpdate[5639]: step time server 52.48.113.20 offset 16.268760 sec
That is to say ntpdate corrected 4.53 seconds of skew, we waited 15 minutes, and then 16 seconds of lag were then corrected with the xen
clock source.
We also see the same thing with tsc
as a source. However, given that these are all PV instances we believe the time source is broadly the same:
root@ip-172-31-20-47:/var/lib/ntp# echo tsc > /sys/devices/system/clocksource/clocksource0/current_clocksource
root@ip-172-31-20-47:/var/lib/ntp# cat /sys/devices/system/clocksource/clocksource0/current_clocksource
tsc
root@ip-172-31-20-47:/var/lib/ntp# ntpdate 0.amazon.pool.ntp.org && sleep 900 && ntpdate 0.amazon.pool.ntp.org
13 Dec 14:00:25 ntpdate[5786]: step time server 52.48.113.20 offset 3.049115 sec
13 Dec 14:15:48 ntpdate[8165]: step time server 31.28.161.68 offset 16.155150 sec