2014년 12월 6일 토요일

Cluster set up with MMS becoming non-responsive after a few days

We currently have two Mongo replica sets in use; the first is our primary datastore, which is running MongoDB v2.4.x and is managed through chef. The second is our logging datastore, which is running v2.6.5 and is set up and managed through MMS.

We have zero stability issues with the primary datastore. The replica set that was set up by MMS, however, is constantly having problems. Every few days the primary will become non-responsive and go down, effectively bringing down the entire replica set since a new primary is not elected (3 node replica set, and MMS doesn't create an arbiter). When the server is down, it is considered 'reachable' by Amazon, but it was dysfunctional. According to nmap, 3 ports (including ftp and ssh) were open, but we could not actually connect.

All servers are on AWS, US West 2 (Oregon) region
(3) R3 Large (memory optimized) instance types
Instances were provisioned by MMS
Ubuntu 14.04 LTS
The only changes to the server config we have made was to install Ganglia
~40% disk utilization
One collection in the database



Could you try and fetch the mongod logs from your replica set and see what they say?
Here is how you do that:
  1. Go to your Deployment page and enter "edit mode"
  2. Click the arrow in the edit column for a member of your replica set
  3. Click the gear in the sidebar that appears and click "Request logs"
  4. Now you'll be in the mongo logs tab. Wait about a minute and click the "Show Log" link
If that doesn't work, check to see if the servers are active in the Servers tab. If you can SSH to them, check the mongod logs on the hosts directly.



The only entries in the logs that I can find are these:
2014-11-25T21:56:37.377+0000 [clientcursormon] mem (MB) res:13137 virt:329002
2014-11-25T21:56:37.377+0000 [clientcursormon]  mapped (incl journal view):327746
2014-11-25T21:56:37.377+0000 [clientcursormon]  connections:215
2014-11-25T21:56:37.377+0000 [clientcursormon]  replication threads:32

(the above was randomly peppered throughout the logs, not necessarily associated with an outage)

:57.499+0000 [conn331355] end connection 10.34.129.202:46101 (238 connections now open)
2014-11-26T04:24:57.500+0000 [initandlisten] connection accepted from 10.34.129.202:46107 #331357 (239 connections now open)
2014-11-26T04:25:09.433+0000 [rsBackgroundSync] changing sync target because current sync target's most recent OpTime is Nov 26 04:24:11:1 which is more than 30 seconds behind member mdbs-logs-4.nuorder-logs-production.6821.mongodbdns.com:27000 whose most recent OpTime is 1416975909
2014-11-26T04:25:09.433+0000 [rsBackgroundSync] replSet syncing to: mdbs-logs-3.nuorder-logs-production.6821.mongodbdns.com:27000
2014-11-26T04:25:09.436+0000 [rsBackgroundSync] replset setting syncSourceFeedback to mdbs-logs-3.nuorder-logs-production.6821.mongodbdns.com:27000
2014-11-26T04:25:09.442+0000 [rsBackgroundSync] changing sync target because current sync target's most recent OpTime is Nov 26 04:24:11:1 which is more than 30 seconds behind member mdbs-logs-4.nuorder-logs-production.6821.mongodbdns.com:27000 whose most recent OpTime is 1416975909
2014-11-26T04:25:09.444+0000 [rsBackgroundSync] replset setting syncSourceFeedback to mdbs-logs-3.nuorder-logs-production.6821.mongodbdns.com:27000
2014-11-26T04:25:09.450+0000 [rsBackgroundSync] changing sync target because current sync target's most recent OpTime is Nov 26 04:24:11:1 which is more than 30 seconds behind member mdbs-logs-4.nuorder-logs-production.6821.mongodbdns.com:27000 whose most recent OpTime is 1416975909
2014-11-26T04:25:09.452+0000 [rsBackgroundSync] replset setting syncSourceFeedback to mdbs-logs-3.nuorder-logs-production.6821.mongodbdns.com:27000
2014-11-26T04:25:09.458+0000 [rsBackgroundSync] changing sync target because current sync target's most recent OpTime is Nov 26 04:24:11:1 which is more than 30 seconds behind member mdbs-logs-4.nuorder-logs-production.6821.mongodbdns.com:27000 whose most recent OpTime is 1416975909
2014-11-26T04:25:09.460+0000 [rsBackgroundSync] replset setting syncSourceFeedback to mdbs-logs-3.nuorder-logs-production.6821.mongodbdns.com:27000
2014-11-26T04:25:09.466+0000 [rsBackgroundSync] changing sync target because current sync target's most recent OpTime is Nov 26 04:24:11:1 which is more than 30 seconds behind member mdbs-logs-4.nuorder-logs-production.6821.mongodbdns.com:27000 whose most recent OpTime is 1416975909
2014-11-26T04:25:09.468+0000 [rsBackgroundSync] replset setting syncSourceFeedback to mdbs-logs-3.nuorder-logs-production.6821.mongodbdns.com:27000
2014-11-26T04:25:09.474+0000 [rsBackgroundSync] changing sync target because current sync target's most recent OpTime is Nov 26 04:24:11:1 which is more than 30 seconds behind member mdbs-logs-4.nuorder-logs-production.6821.mongodbdns.com:27000 whose most recent OpTime is 1416975909
2014-11-26T04:25:09.476+0000 [rsBackgroundSync] replset setting syncSourceFeedback to mdbs-logs-3.nuorder-logs-production.6821.mongodbdns.com:27000
2014-11-26T04:25:09.482+0000 [rsBackgroundSync] changing sync target because current sync target's most recent OpTime is Nov 26 04:24:11:1 which is more than 30 seconds behind member mdbs-logs-4.nuorder-logs-production.6821.mongodbdns.com:27000 whose most recent OpTime is 1416975909
2014-11-26T04:25:09.484+0000 [rsBackgroundSync] replset setting syncSourceFeedback to mdbs-logs-3.nuorder-logs-production.6821.mongodbdns.com:27000
2014-11-26T04:25:09.490+0000 [rsBackgroundSync] changing sync target because current sync target's most recent OpTime is Nov 26 04:24:11:1 which is more than 30 seconds behind member mdbs-logs-4.nuorder-logs-production.6821.mongodbdns.com:27000 whose most recent OpTime is 1416975909
2014-11-26T04:25:09.492+0000 [rsBackgroundSync] replset setting syncSourceFeedback to mdbs-logs-3.nuorder-logs-production.6821.mongodbdns.com:27000
2014-11-26T04:25:09.498+0000 [rsBackgroundSync] changing sync target because current sync target's most recent OpTime is Nov 26 04:24:11:1 which is more than 30 seconds behind member mdbs-logs-4.nuorder-logs-production.6821.mongodbdns.com:27000 whose most recent OpTime is 1416975909
2014-11-26T04:25:09.500+0000 [rsBackgroundSync] replset setting syncSourceFeedback to mdbs-logs-3.nuorder-logs-production.6821.mongodbdns.com:27000
2014-11-26T04:25:09.506+0000 [rsBackgroundSync] changing sync target because current sync target's most recent OpTime is Nov 26 04:24:11:1 which is more than 30 seconds behind member mdbs-logs-4.nuorder-logs-production.6821.mongodbdns.com:27000 whose most recent OpTime is 1416975909
2014-11-26T04:25:09.508+0000 [rsBackgroundSync] replset setting syncSourceFeedback to mdbs-logs-3.nuorder-logs-production.6821.mongodbdns.com:27000

Some additional information:
As this is a log server, it receives a ton of inserts, zero updates, and relatively few reads. There were approximately 58M documents in the collection at the time of the last outage.



The only entries in the logs that I can find are these:
2014-11-25T21:56:37.377+0000 [clientcursormon] mem (MB) res:13137 virt:329002
2014-11-25T21:56:37.377+0000 [clientcursormon]  mapped (incl journal view):327746
2014-11-25T21:56:37.377+0000 [clientcursormon]  connections:215
2014-11-25T21:56:37.377+0000 [clientcursormon]  replication threads:32

(the above was randomly peppered throughout the logs, not necessarily associated with an outage)

:57.499+0000 [conn331355] end connection 10.34.129.202:46101 (238 connections now open)
2014-11-26T04:24:57.500+0000 [initandlisten] connection accepted from 10.34.129.202:46107 #331357 (239 connections now open)
2014-11-26T04:25:09.433+0000 [rsBackgroundSync] changing sync target because current sync target's most recent OpTime is Nov 26 04:24:11:1 which is more than 30 seconds behind member {{hostname-4 redacted}} whose most recent OpTime is 1416975909
2014-11-26T04:25:09.433+0000 [rsBackgroundSync] replSet syncing to: {{hostname-3 redacted}}
2014-11-26T04:25:09.436+0000 [rsBackgroundSync] replset setting syncSourceFeedback to {{hostname-3 redacted}}
2014-11-26T04:25:09.442+0000 [rsBackgroundSync] changing sync target because current sync target's most recent OpTime is Nov 26 04:24:11:1 which is more than 30 seconds behind member {{hostname-4 redacted}} whose most recent OpTime is 1416975909
2014-11-26T04:25:09.444+0000 [rsBackgroundSync] replset setting syncSourceFeedback to {{hostname-3 redacted}}
2014-11-26T04:25:09.450+0000 [rsBackgroundSync] changing sync target because current sync target's most recent OpTime is Nov 26 04:24:11:1 which is more than 30 seconds behind member {{hostname-4 redacted}} whose most recent OpTime is 1416975909
2014-11-26T04:25:09.452+0000 [rsBackgroundSync] replset setting syncSourceFeedback to {{hostname-3 redacted}}
2014-11-26T04:25:09.458+0000 [rsBackgroundSync] changing sync target because current sync target's most recent OpTime is Nov 26 04:24:11:1 which is more than 30 seconds behind member {{hostname-4 redacted}} whose most recent OpTime is 1416975909
2014-11-26T04:25:09.460+0000 [rsBackgroundSync] replset setting syncSourceFeedback to {{hostname-3 redacted}}
2014-11-26T04:25:09.466+0000 [rsBackgroundSync] changing sync target because current sync target's most recent OpTime is Nov 26 04:24:11:1 which is more than 30 seconds behind member {{hostname-4 redacted}} whose most recent OpTime is 1416975909
2014-11-26T04:25:09.468+0000 [rsBackgroundSync] replset setting syncSourceFeedback to {{hostname-3 redacted}}
2014-11-26T04:25:09.474+0000 [rsBackgroundSync] changing sync target because current sync target's most recent OpTime is Nov 26 04:24:11:1 which is more than 30 seconds behind member {{hostname-4 redacted}} whose most recent OpTime is 1416975909
2014-11-26T04:25:09.476+0000 [rsBackgroundSync] replset setting syncSourceFeedback to {{hostname-3 redacted}}
2014-11-26T04:25:09.482+0000 [rsBackgroundSync] changing sync target because current sync target's most recent OpTime is Nov 26 04:24:11:1 which is more than 30 seconds behind member {{hostname-4 redacted}} whose most recent OpTime is 1416975909
2014-11-26T04:25:09.484+0000 [rsBackgroundSync] replset setting syncSourceFeedback to {{hostname-3 redacted}}
2014-11-26T04:25:09.490+0000 [rsBackgroundSync] changing sync target because current sync target's most recent OpTime is Nov 26 04:24:11:1 which is more than 30 seconds behind member {{hostname-4 redacted}} whose most recent OpTime is 1416975909
2014-11-26T04:25:09.492+0000 [rsBackgroundSync] replset setting syncSourceFeedback to {{hostname-3 redacted}}
2014-11-26T04:25:09.498+0000 [rsBackgroundSync] changing sync target because current sync target's most recent OpTime is Nov 26 04:24:11:1 which is more than 30 seconds behind member {{hostname-4 redacted}} whose most recent OpTime is 1416975909
2014-11-26T04:25:09.500+0000 [rsBackgroundSync] replset setting syncSourceFeedback to {{hostname-3 redacted}}
2014-11-26T04:25:09.506+0000 [rsBackgroundSync] changing sync target because current sync target's most recent OpTime is Nov 26 04:24:11:1 which is more than 30 seconds behind member {{hostname-4 redacted}} whose most recent OpTime is 1416975909
2014-11-26T04:25:09.508+0000 [rsBackgroundSync] replset setting syncSourceFeedback to {{hostname-3 redacted}}

Some additional information:
As this is a log server, it receives a ton of inserts, zero updates, and relatively few reads. There were approximately 58M documents in the collection at the time of the last outage.



I'm looking at your "... Logs Production"* group and everything looks OK now. Sorry about the delay from the Thanksgiving holiday.

Is there anything else I can help with?



No problem about the delay, I was away for the Holiday as well.

We had another outage on the system this morning around 5:50 am Pacific Time.



The only thing I can see in the MMS logs for your replica set around that time is the reboot of one of your secondaries (#5). The logs MMS would pull are too new to shed any light on what might have caused the reboot of this server. I'd suggest looking into the mongod and system logs on this server around this time to see what led to the reboot.


댓글 없음:

댓글 쓰기