On production I've a 10 nodes cluster, 5 PRIMARY, 5 SECONDARY, 3 CONFIGS, 2 MONGOS.
However to simplify use case I've started test cluster 2 PRIMARY, 1 SECONDARY, 1 CONFIG, 1 MONGOS by issuing:
then I'm attaching another node from virtual server:
which takes over as PRIMARY - this node I'll be shutting down.
issuing 2 read queries per second to sharded collection.
Then I simulate PRIMARY server problem by issuing "halt -n -f".
Now according to driver I'm using I've got downtime from about 20s (original mongo gem) to 60s (mongoid4 on mongo 2.4).
Upgrading to mongo 2.6 didn't improve downtime at all.
This is almost the same as on my production setup.
So my question is Can I do it better?
looking forward to hear from you,
Could you clarify your setup a bit? What is the purpose of the config servers and mongos and how do they relate to the problem? Config servers and mongos are needed for sharded clusters, not for a replica set. For your simplified setup, you must be running 2 shards, each a replica set, but then which shard has a secondary and which doesn't? If you add a new primary to the 1 node replica set and then take it down, that replica set won't have a majority. You need to clarify exactly what you're doing. Can you reproduce this behavior with just a 3 node replica set and no sharding components?
Could you clarify your setup a bit?
My production setup consist of 5 shards, each of them has PRIMARY, SECONDARY & ARBITER. In front of them I've two MONGOS on separate nodes.
What is the purpose of the config servers and mongos and how do they relate to the problem?
Downtime can be only observed in sharded environment. As my test scripts connects to db via MONGOS it's a vital part of test env.
Config servers and mongos are needed for sharded clusters, not for a replica set. For your simplified setup, you must be running 2 shards, each a replica set, but then which shard has a secondary and which doesn't? If you add a new primary to the 1 node replica set and then take it down, that replica set won't have a majority. You need to clarify exactly what you're doing.
$ mongo --nodb
MongoDB shell version: 2.6.5
> cluster = new ShardingTest({shards : 2, rs : {nodes : [{}, {arbiter: true}]} });
it will form 2 shards, each replicaset with only PRIMARY member.
I can't get this command to work passing secondary node as: "cluster = new ShardingTest({shards : 2, rs : {nodes : [{}, {}, {arbiter: true}]} });" - replset wont start then, but for this test it is meaningless so we can skip this.
Then I'm connecting to first PRIMARY of second replset:
$ mongo --port 31200
test-rs1:PRIMARY> rs.add({_id: 2, host: "192.168.33.10:27017", priority: 2});
{ "ok" : 1 }
after STARTUP phase newly added node becomes PRIMARY
Then I form sharding cluster as:
$ mongo --port 30999
MongoDB shell version: 2.6.5
connecting to: 127.0.0.1:30999/test
mongos> sh.enableSharding("mongodb_ outage_test");
{ "ok" : 1 }
mongos> use mongodb_outage_test;
switched to db mongodb_outage_test
mongos> sh.shardCollection("mongodb_ outage_test.data", { "_id": 1 } );
{ "collectionsharded" : "mongodb_outage_test.data", "ok" : 1 }
And finally add some data (it takes a while):
mongos> for (i = 0; i < 100000; i++) {
... document = {_id: i, dane: Array(128).join(i) };
... db.data.insert( document);
... }
Then I start my test script (https://gist.github.com/ lowang/5fc24c6e40b03a613d2b) which is using MONGOS to connect to a cluster, it is using read: :secondary_preferred
then I execute "halt -n -f" to kill PRIMARY on second replicaset
Can you reproduce this behavior with just a 3 node replica set and no sharding components?
I can't. I can't use mongos in such case so I switch to Mongo::MongoReplicaSetClient passing all members directly - It works flawlessly, there is no downtime at all when PRIMARY goes down.
I've got it running on simpler setup:
one replicaset: one PRIMARY, one SECONDARY, one ARBITER,
one shard: one CONFIG, one MONGOS, sharding "enabled" - because only one shard is running all nodes have all the data.
Then I've run previous mongo script to hammer it with 2 reads per second, as expected I've got downtime 21s during PRIMARY downtime,
none of read queries could complete during that time.
So I guess that MONGOS is the one responsible here, but it's logs seems valid:
[ReplicaSetMonitorWatcher] Socket recv() timeout 192.168.33.10:27017
[ReplicaSetMonitorWatcher] SocketException: remote:192.168.33.10:27017 error: 9001 socket exception [RECV_TIMEOUT] server [192.168.33.10:27017]
[ReplicaSetMonitorWatcher] DBClientCursor::init call() failed
[ReplicaSetMonitorWatcher] Detected bad connection created at 1415193150556494 microSec, clearing pool for 192.168.33.10:27017of 0 connections
[ReplicaSetMonitorWatcher] warning: No primary detected for set test-rs0
are there any params that I can tune to shorten that downtime to maximum 1 second ?
댓글 없음:
댓글 쓰기