We have a problem of aggregation queries running long time (couple of minutes).
Collection:
We have a collection of 250 million documents with about 20 fields per document,
The total size of the collection is 110GB.
We have indexes over "our_id" and dtKey fields.
Hardware:
Memory:
24GB RAM (6 * 4GB DIMMS 1333 Mhz)
Disk:
Lvm 11TB built from 4 disks of 3TB disks:
• 600MB/s maximum instantaneous data transfers.
• 7200 RPM spindle. Average latency = 4.16ms
· RAID 0
2* E5-2420 0 @ 1.90GHz
Total of 12 cores with 24 threads.
Dell R420.
Problem:
We are trying to make an aggregation query of the following:
db.our_collection.aggregate(
[
{
"$match":
{
"$and":
[
{"dtKey":{"$gte":20140916}},
{"dtKey":{"$lt": 20141217}},
{"our_id":"111111111"}
]
}
},
{
"$project":
{
"field1":1,
"date":1
}
},
{
"$group":
{
"_id":
{
"day":{"$dayOfYear":"$date"},
"year":{"$year":"$date"}
},
"field1":{"$sum":"$field1"}
}
}
]
);
This query takes a couple of minutes to run, when it is running we can see the followings:
1. Mongo current operation is yielding more than 300K times
2. On iostat we see ~100% disk utilization
After this query is done it seems to be in cache and this can be done again in a split second,
After running it for 3 – 4 users it seems that the first one is already been swapped out from the cache and the query takes a long time again.
We have tested a count on the matching part and seen that we have users of 50K documents as well as users with 500K documents,
We tried to get only the matching part:
db.pub_stats.aggregate(
[
{
"$match":
{
"$and":
[
{"dtKey":{"$gte":20140916}},
{"dtKey":{"$lt":20141217}},
{" our_id ":"112162107"}
]
}
}
]
);
And the queries seems to take approximately 300-500M of memory,
But after running the full query, it seems to take 3.5G of memory.
Questions:
1. Why the pipelining of the aggregation takes so much memory?
2. How can we increase our performance for it to run on a reasonable time for HTTP request?
댓글 없음:
댓글 쓰기