I've been experimenting with loading a moderately large dataset: ~1 billion documents averaging 3k per document. Cluster setup has been: 9 shards, 3 config servers, as many query servers as necessary - we've been testing with 10, and no replication, spread over 10 physical boxes. Each box has 64G RAM to work with.
We started with a baseline benchmark without using sharding as the time to beat: we manually split up the dataset and inserted in parallel to non-sharded mongod instances. This finished in about 24 hours.
Up until last week we had been experimenting with tuning 2.6 and we had hit a wall. Using a variety of pre-splitting and shard key strategies we hadn't been able to break roughly 200M documents loaded in 10 hours. Nowhere near our benchmark time.
Now using 2.8 with WiredTiger, we were able to hit ~500M documents loaded in about 9 hours. Configuration was default WiredTiger parameters, pre-split sharded collection over 9 nodes. However, somewhere in the next 9 hours there was a significant slowdown, resulting in only 700M documents loaded at the 18 hour mark.
I noticed a similar behavior at this point to our 2.6 loading slowdown looking at db.currentOp() - while inserts in the beginning were being distributed evenly over the shards, inserts eventually began to settle down into one shard at a time, rotating every few minutes or so. Additionally, db.currentOp() is listing somewhere on the order of 40 operations at a time, with some operations listing "secs_running" in the 30-60 second range.
Has anyone else run into similar bottlenecks and had any success in tuning?
Let's check a few things before getting into the nitty gritty:
What version of 2.8.0? There've been several release candidates so far and various bugs fixed along the way.
What is your shard key? You mention pre-splitting - can you give some details about this? What you describe with inserts going into a single shard (eventually) suggests some problem with shard key distribution (or possibly pre-splitting issue).
When things "slow down" - is it possible that you are getting to the point where everything you need to touch on insert no longer fits in RAM (i.e. cache_size configured by WiredTiger - by default 1/2 of physical memory on the machine)?
Are there migrations running during the load? When you pre-split, best practice is to disable the balancer for the load stage - if that's not done, it's possible/likely that the slowdown factor are the migrations which will be operating on a different part of the data than your inserts are dealing with.
Also worth noting we were initially testing on a raid5 array, which is obviously not ideal. Since then I've tested on 10 nodes with SSDs available. Obviously much better results, although write concentration still seems to happen towards the end of the load.
I'd also like to test using a set of shards with different performance characteristics, expecting to see the slowest shards being a bottleneck on load performance.
What version of 2.8.0? There've been several release candidates so far and various bugs fixed along the way.
2.8.0-rc3
What is your shard key? You mention pre-splitting - can you give some details about this? What you describe with inserts going into a single shard (eventually) suggests some problem with shard key distribution (or possibly pre-splitting issue).
I've tested using both hashed _id and hashed uuid (a field in our dataset, a java-gerated uuid).
When things "slow down" - is it possible that you are getting to the point where everything you need to touch on insert no longer fits in RAM (i.e. cache_size configured by WiredTiger - by default 1/2 of physical memory on the machine)?
The noticeable slow down happens well after filling available RAM, interestingly enough. There's an initial slowdown after mongod saturates the RAM, then a second much greater slowdown when inserts become concentrated on one node.
Are there migrations running during the load? When you pre-split, best practice is to disable the balancer for the load stage - if that's not done, it's possible/likely that the slowdown factor are the migrations which will be operating on a different part of the data than your inserts are dealing with.
All the tests have been performed with the balancer off.
Responses inline:
On Mon, Dec 29, 2014 at 8:35 AM, Kevin Katcher <kevkat@gmail.com> wrote:
> Also worth noting we were initially testing on a raid5 array, which is
> obviously not ideal. Since then I've tested on 10 nodes with SSDs available.
> Obviously much better results, although write concentration still seems to
> happen towards the end of the load.Right, overall throughput will be better as your disk speed/IOPS increase, but
if the bottleneck is something else, then you will hit that eventually
just the same.
>> What is your shard key? You mention pre-splitting - can you give some
>> details about this? What you describe with inserts going into a single
>> shard (eventually) suggests some problem with shard key distribution (or
>> possibly pre-splitting issue).
> I've tested using both hashed _id and hashed uuid (a field in our dataset, a
> java-gerated uuid).When you initially shard the collection with a hashed shard key, you should
get as a result of that command two chunks on each shard. I would recommend
running sh.status() after the sharding of the empty collection is complete
(I'm assuming that the collection is empty when you're starting out, if not
then you have work to do yourself splitting and distributing the chunks evenly
across the shards).
>> When things "slow down" - is it possible that you are getting to the point
>> where everything you need to touch on insert no longer fits in RAM (i.e.
>> cache_size configured by WiredTiger - by default 1/2 of physical memory on
>> the machine)?
> The noticeable slow down happens well after filling available RAM,
> interestingly enough. There's an initial slowdown after mongod saturates the
> RAM, then a second much greater slowdown when inserts become concentrated on
> one node.Okay, so the first of the slowdowns is normal - you are transitioning from
having a working dataset that fits in RAM to one that no longer fits in RAM and
therefore will go slower.
The second slowdown is *not* normal. If your chunk ranges are distributed
evenly, then the inserts should continue going equally into all of the shards.
> All the tests have been performed with the balancer off.Great. All that remains to determine is that when your insert test starts that
your chunk ranges are equally distributed across the shards. Can you check
that please?
sh.status() will give you the basic distribution
On Mon, Dec 29, 2014 at 8:35 AM, Kevin Katcher <kevkat@gmail.com> wrote:
> Also worth noting we were initially testing on a raid5 array, which is
> obviously not ideal. Since then I've tested on 10 nodes with SSDs available.
> Obviously much better results, although write concentration still seems to
> happen towards the end of the load.Right, overall throughput will be better as your disk speed/IOPS increase, but
if the bottleneck is something else, then you will hit that eventually
just the same.
>> What is your shard key? You mention pre-splitting - can you give some
>> details about this? What you describe with inserts going into a single
>> shard (eventually) suggests some problem with shard key distribution (or
>> possibly pre-splitting issue).
> I've tested using both hashed _id and hashed uuid (a field in our dataset, a
> java-gerated uuid).When you initially shard the collection with a hashed shard key, you should
get as a result of that command two chunks on each shard. I would recommend
running sh.status() after the sharding of the empty collection is complete
(I'm assuming that the collection is empty when you're starting out, if not
then you have work to do yourself splitting and distributing the chunks evenly
across the shards).
>> When things "slow down" - is it possible that you are getting to the point
>> where everything you need to touch on insert no longer fits in RAM (i.e.
>> cache_size configured by WiredTiger - by default 1/2 of physical memory on
>> the machine)?
> The noticeable slow down happens well after filling available RAM,
> interestingly enough. There's an initial slowdown after mongod saturates the
> RAM, then a second much greater slowdown when inserts become concentrated on
> one node.Okay, so the first of the slowdowns is normal - you are transitioning from
having a working dataset that fits in RAM to one that no longer fits in RAM and
therefore will go slower.
The second slowdown is *not* normal. If your chunk ranges are distributed
evenly, then the inserts should continue going equally into all of the shards.
> All the tests have been performed with the balancer off.Great. All that remains to determine is that when your insert test starts that
your chunk ranges are equally distributed across the shards. Can you check
that please?
sh.status() will give you the basic distribution
sh.status() is quite long as I've started another run and the collection is already presplit, but the chunks are definitely distributed evenly. I run the initial split using:
db.adminCommand( { shardCollection: "ourdb.collection_name", key: { uuid : "hashed" }, numInitialChunks: 27000 } );
This results in 3000 chunks per shard at 9 shards currently distributed linearly.
Also worth mentioning, all of these tests so far have been using the java driver with unordered bulk inserts.
And your batches are how big? How many threads in the Java client are doing inserting (and into how many mongos')?
I've been testing with 10 Java processes running 3 threads each, pointing to 10 mongos query servers. I'm assuming that since I didn't have any splitting up of the bulk operations in the Java code, the maximum of 1000 is reached and mongos splits these up.
Another new performance clue though: after talking about bulk inserting we decided to run the Java loaders without bulk operations, and indeed, the slow down is not observed and we manage to load 1.33 billion documents in ~18 hours.
댓글 없음:
댓글 쓰기