2014년 12월 29일 월요일

MongoDB Aggregation Performance

We have a problem of aggregation queries running long time (couple of minutes).
Collection:
We have a collection of 250 million documents with about 20 fields per document,
The total size of the collection is 110GB.
We have indexes over "our_id" and dtKey fields.
Hardware:
Memory:
24GB RAM (6 * 4GB DIMMS 1333 Mhz)
Disk:
Lvm 11TB built from 4 disks of 3TB disks:
        600MB/s maximum instantaneous data transfers.
        7200 RPM spindle. Average latency = 4.16ms
·         RAID 0
2* E5-2420 0 @ 1.90GHz
Total of 12 cores with 24 threads.
Dell R420.

Problem:
We are trying to make an aggregation query of the following:
db.our_collection.aggregate(
    [
        {
            "$match":
            {
                "$and":
                    [
                        {"dtKey":{"$gte":20140916}},
                        {"dtKey":{"$lt":20141217}},
                        {"our_id":"111111111"}
                    ]
            }
        },
        {
            "$project":
            {
                "field1":1,
                "date":1
            }
        },
        {
            "$group":
            {
                "_id":
                {
                    "day":{"$dayOfYear":"$date"},
                    "year":{"$year":"$date"}
                },
                "field1":{"$sum":"$field1"}
            }
        }
    ]
);
This query takes a couple of minutes to run, when it is running we can see the followings:
1.      Mongo current operation is yielding more than 300K times
2.      On  iostat we see ~100% disk utilization
After this query is done it seems to be in cache and this can be done again in a split second,
After running it for 3 – 4 users it seems that the first one is already been swapped out from the cache and the query takes a long time again.

We have tested a count on the matching part and seen that we have users of 50K documents as well as users with 500K documents,
We tried to get only the matching part:
db.pub_stats.aggregate(
    [
        {
            "$match":
            {
                "$and":
                    [
                        {"dtKey":{"$gte":20140916}},
                        {"dtKey":{"$lt":20141217}},
                        {" our_id ":"112162107"}
                    ]
            }
        }
    ]
);

And the queries seems to take approximately 300-500M of memory,
But after running the full query, it seems to take 3.5G of memory.

Questions:
1.       Why the pipelining of the aggregation takes so much memory?
2.       How can we increase our performance for it to run on a reasonable time for HTTP request?


댓글 없음:

댓글 쓰기