2014년 12월 14일 일요일

AWS - EC2 - MongoDB replica set time sync issues - NTP - Replication lag

We are encountering clock drift issues with our MongoDB replica set running on AWS. This just seemed to start happening recently after we added additional data to the set, before then we did not really notice this issue unless the system was under heavy load. The following error is logged in the mongod.log file sporadically and the system is not under load. 

To test this we have isolated a set of machines with the same dataset and not in use by our web application though the error is still occurring;

2014-12-12T13:33:51.333+0000 [rsBackgroundSync] changing sync target because current sync target's most recent OpTime is Dec 12 13:32:42:c which is more than 30 seconds behind member mongo1:27017 whose most recent OpTime is 1418391230

From the above the time stamp shows that one of the mongodb replica set members is over a minute behind. The worst we have seen is 12 minutes out of sync.

This error in turn causes replication lag and we receive the notification about this from the Mongo Monitoring Service although it does correct itself. 

The setup is 3 x r3.xlarge AWS Linux instances, 1 in each availability zone of the EU-West-1A region. The machines have been setup using the Mongo recommended settings with a Raid array and the cloud formation scripts provided by Mongo. The data is around 4GB in size.

We think the issue is related to the NTP sync, by default on the AWS Linux Amazon Machine Image the ntpd service is configured to go to a pool of aws ntp servers hosted on www.pool.ntp.org

To try and rule this out we setup our own NTP server on AWS that the MongoDB servers could sync to. The issue still occurred so we changed the maxpoll and minpoll time for the ntpd service on the mongo machines to sync the time every 16 seconds from the NTP server but the error is still occurring. 

We increased the MongoDB OpLog size as well to see if that would make any difference but it didn't. 

Does anyone else encounter this type of issue? Is there something we are missing?


댓글 없음:

댓글 쓰기