2014년 11월 29일 토요일

Fast query on 2.6.4 runs slow on 2.6.5 due to wrong index selection

Since upgrading to 2.6.5, we have a query that was previously fast (on 2.6.4) (less than a second) start running slow on us (4 minutes).  Furthermore, running explain on the query shows one index being selected while running the actual query itself hits another index.

In this collection, documents are uniquely identified by and queried for via the following complex key fields:
    siid : int32 - 17919 distinct values
    svid : int32 - 237 distinct values
    date : int64 - 806 distinct values
    reg  : int32 - 11 distinct values
    dev  : int32 - 9 distinct values
    
We have the following indexes:
    { svid: 1, date: -1, reg: 1, dev: 1 }
    { siid: 1, date: -1, dev: 1}
    
Total # of documents in collection:
    163 million

The following query (issued via C# client) went from nearly instantaneous to painfully slow (4 minutes) after upgrading to 2.6.5:
{ "dev" : { "$in" : [0] }, "siid" : 78, "svid" : { "$in" : [49, 23, 93, 172, 102, 4997225, 16342, 2451, 645, 200, 97, 226, 25, 16343, 378420, 75401, 4486732, 509, 40, 127, 119, 29281, 42, 916, 18, 134143, 72081, 503, 126535, 118, 605, 5120598, 4467928, 5021292, 891, 384, 4610579, 897, 53317, 2178777, 324, 327, 5120597, 5120629, 124092, 938202, 321, 138986, 2243, 1484, 240532, 20270, 1219, 5103414, 44, 147347, 5120575, 1491, 5120609, 117, 5120593, 122, 190, 5120619, 61, 440, 5120603, 1902, 3767020, 5120623, 5120599, 5120594, 122721, 5120589, 38289, 214, 5120600, 2038374, 316043, 479, 1418818, 5103094, 5120591, 5120586, 5120590, 5120605, 5120580, 5854, 72872, 232, 29781, 5120612, 2156, 202, 635481, 5575, 3343185, 4009780, 5120578, 5021223, 3125718, 5120618, 331, 1604, 3841, 5120607] }, "$or" : [{ "date" : NumberLong(1410307200) }, { "date" : 1412812800 }] }

Running explain on 2.6.5 yielded this query plan (the same that explain on 2.6.4 shows):
{ "cursor" : "BtreeCursor siid_1_date_-1_dev_1", "isMultiKey" : false, "n" : 133, "nscannedObjects" : 50381, "nscanned" : 50628, "nscannedObjectsAllPlans" : 52526, "nscannedAllPlans" : 54923, "scanAndOrder" : false, "indexOnly" : false, "nYields" : 429, "nChunkSkips" : 0, "millis" : 119, "indexBounds" : { "siid" : [[78, 78]], "date" : [[{ "$maxElement" : 1 }, { "$minElement" : 1 }]], "dev" : [[0, 0]] }, "server" : "ip-10-140-147-188:27017", "filterSet" : false, "stats" : { "type" : "KEEP_MUTATIONS", "works" : 50630, "yields" : 429, "unyields" : 429, "invalidates" : 0, "advanced" : 133, "needTime" : 50496, "needFetch" : 0, "isEOF" : 1, "children" : [{ "type" : "FETCH", "works" : 50630, "yields" : 429, "unyields" : 429, "invalidates" : 0, "advanced" : 133, "needTime" : 50496, "needFetch" : 0, "isEOF" : 1, "alreadyHasObj" : 0, "forcedFetches" : 0, "matchTested" : 133, "children" : [{ "type" : "IXSCAN", "works" : 50629, "yields" : 429, "unyields" : 429, "invalidates" : 0, "advanced" : 50381, "needTime" : 248, "needFetch" : 0, "isEOF" : 1, "keyPattern" : "{ siid: 1, date: -1, dev: 1 }", "isMultiKey" : 0, "boundsVerbose" : "field #0['siid']: [78, 78], field #1['date']: [MaxKey, MinKey], field #2['dev']: [0, 0]", "yieldMovedCursor" : 0, "dupsTested" : 0, "dupsDropped" : 0, "seenInvalidated" : 0, "matchTested" : 0, "keysExamined" : 50628, "children" : [] }] }] } } 

The index selected, siid_1_date_-1_dev_1, is what I'd expect.  However, in practice, running this query on 2.6.5 was taking minutes to finish.  I checked db.currentOp() and found this in the planSummary for this query: "IXSCAN { svid: 1, date: -1, reg: 1, dev: 1 }"   So the server was picking the wrong index (note that our query does NOT include a 'reg' parameter).

We downgraded the server back to 2.6.4 and then this query went back to its same speedy self.

These results were gathered in isolation with no other queries running.  This collection is not sharded and resides entirely on the single server I hit this on.

Any ideas why the query was being executed against the wrong index?



Could you try to re-upgrade to 2.6.5, reproduce the slow query issue, and then post the output of running db.collection.getPlanCache().getPlansByQuery(<query predicate>)?  This should reveal why the query planner chose to cache a plan using the { svid: 1, date: -1, reg: 1, dev: 1 } index for the query.  Note that explain bypasses the plan cache, which is likely why you see an inconsistency between the plans for the non-explain queries vs. the explain queries.



Ah, I didn't realize that explain bypassed the plan cache.

I've upgraded the server back to 2.6.5 but can't get the plan cache back into this state (via running some of our other queries and running this query with various parameters, clearing the plan cache periodically).

I'm going to leave this server on 2.6.5 and will dig in more if the problem resurfaces.



After a month of running without incident (on 2.6.5), we back into this state again.  Interestingly enough, last night we dropped the database and reloaded it with single character field names (we're pushing a billion documents across all databases).

To recap, we have these indexes (new field names):

Index 1: { b: 1, d: -1, r: 1, v: 1 }
Index 2: { s: 1, d: -1, v: 1}
  
We're querying on fields 'b', 's', 'd', 'v' and the query planner, at the moment, is choosing index 1 which results in a slow query instead of index 2, which runs fast.  Our code is such that we can include field 'r' with our query which will solve this for us but if this is useful to you guys, I can leave us in this state for a few hours to help debug.

Below is the output of getPlansByQuery.

prodtwo:PRIMARY> db.siteWidgets.getPlanCache().getPlansByQuery({ "v" : { "$in" : [0] }, "s" : 78, "b" : { "$in" : [49, 23, 93, 172, 102, 4997225, 16342, 2451, 645, 200, 97, 226, 25, 16343, 378420, 75401, 4486732, 509, 40, 127, 119, 29281, 42, 916, 18, 134143, 72081, 503, 126535, 118, 605, 5120598, 4467928, 5021292, 891, 384, 4610579, 897, 53317, 2178777, 324, 327, 5120597, 5120629, 124092, 938202, 321, 138986, 2243, 1484, 240532, 20270, 1219, 5103414, 44, 147347, 5120575, 1491, 5120609, 117, 5120593, 122, 190, 5120619, 61, 440, 5120603, 1902, 3767020, 5120623, 5120599, 5120594, 122721, 5120589, 38289, 214, 5120600, 2038374, 316043, 479, 1418818, 5103094, 5120591, 5120586, 5120590, 5120605, 5120580, 5854, 72872, 232, 29781, 5120612, 2156, 202, 635481, 5575, 3343185, 4009780, 5120578, 5021223, 3125718, 5120618, 331, 1604, 3841, 5120607] }, "$or" : [{ "d" : NumberLong(1410307200) }, { "d" : 1412812800 }] })
[
        {
                "details" : {
                        "solution" : "(index-tagged expression tree: tree=Node\n---Node\n------Leaf \n------Leaf \n---Leaf \n---Leaf { b: 1, d: -1, r: 1, v: 1 }, pos: 0\n---Leaf { b: 1, d: -1, r: 1, v: 1 }, pos: 3\n)"
                },
                "reason" : {
                        "score" : 1.0003000000000002,
                        "stats" : {
                                "type" : "KEEP_MUTATIONS",
                                "works" : 1,
                                "yields" : 1,
                                "unyields" : 1,
                                "invalidates" : 0,
                                "advanced" : 0,
                                "needTime" : 0,
                                "needFetch" : 0,
                                "isEOF" : 1,
                                "children" : [
                                        {
                                                "type" : "FETCH",
                                                "works" : 1,
                                                "yields" : 1,
                                                "unyields" : 1,
                                                "invalidates" : 0,
                                                "advanced" : 0,
                                                "needTime" : 0,
                                                "needFetch" : 0,
                                                "isEOF" : 1,
                                                "alreadyHasObj" : 0,
                                                "forcedFetches" : 0,
                                                "matchTested" : 0,
                                                "children" : [
                                                        {
                                                                "type" : "IXSCAN",
                                                                "works" : 1,
                                                                "yields" : 1,
                                                                "unyields" : 1,
                                                                "invalidates" : 0,
                                                                "advanced" : 0,
                                                                "needTime" : 0,
                                                                "needFetch" : 0,
                                                                "isEOF" : 1,
                                                                "keyPattern" : "{ b: 1, d: -1, r: 1, v: 1 }",
                                                                "isMultiKey" : 0,
                                                                "boundsVerbose" : "field #0['b']: [6, 6], [16, 16], field #1['d']: [MaxKey,MinKey], field #2['r']: [MinKey, MaxKey], field #3['v']: [0, 0]",
                                                                "yieldMovedCursor" : 0,
                                                                "dupsTested" : 0,
                                                                "dupsDropped" : 0,
                                                                "seenInvalidated" : 0,
                                                                "matchTested" : 0,
                                                                "keysExamined" : 0,
                                                                "children" : [ ]
                                                        }
                                                ]
                                        }
                                ]
                        }
                },
                "feedback" : {
                        "nfeedback" : 20,
                        "averageScore" : 1.0003008590619313,
                        "stdDevScore" : 4.464790168780838e-7,
                        "scores" : [
                                {
                                        "score" : 1.0003
                                },
                                {
                                        "score" : 1.0003006763555313
                                },
                                {
                                        "score" : 1.0003007587467225
                                },
                                {
                                        "score" : 1.0003013421611244
                                },
                                {
                                        "score" : 1.000301616371839
                                },
                                {
                                        "score" : 1.0003008342407909
                                },
                                {
                                        "score" : 1.0003013553060092
                                },
                                {
                                        "score" : 1.0003014281906433
                                },
                                {
                                        "score" : 1.000300311575404
                                },
                                {
                                        "score" : 1.000300541460692
                                },
                                {
                                        "score" : 1.000300541460692
                                },
                                {
                                        "score" : 1.000301079352298
                                },
                                {
                                        "score" : 1.0003009803867613
                                },
                                {
                                        "score" : 1.0003010793523965
                                },
                                {
                                        "score" : 1.0003009803867613
                                },
                                {
                                        "score" : 1.0003014812128155
                                },
                                {
                                        "score" : 1.000300311575404
                                },
                                {
                                        "score" : 1.000300311575404
                                },
                                {
                                        "score" : 1.0003007757636662
                                },
                                {
                                        "score" : 1.0003007757636662
                                }
                        ]
                },
                "filterSet" : false
        },
        {
                "details" : {
                        "solution" : "(index-tagged expression tree: tree=Node\n---Node\n------Leaf \n------Leaf \n---Leaf { s: 1, d: -1, v: 1 }, pos: 0\n---Leaf \n---Leaf { s: 1, d: -1, v: 1 }, pos: 2\n)"
                },
                "reason" : {
                        "score" : 1.0003,
                        "stats" : {
                                "type" : "KEEP_MUTATIONS",
                                "works" : 1,
                                "yields" : 1,
                                "unyields" : 1,
                                "invalidates" : 0,
                                "advanced" : 0,
                                "needTime" : 1,
                                "needFetch" : 0,
                                "isEOF" : 0,
                                "children" : [
                                        {
                                                "type" : "FETCH",
                                                "works" : 1,
                                                "yields" : 1,
                                                "unyields" : 1,
                                                "invalidates" : 0,
                                                "advanced" : 0,
                                                "needTime" : 1,
                                                "needFetch" : 0,
                                                "isEOF" : 0,
                                                "alreadyHasObj" : 0,
                                                "forcedFetches" : 0,
                                                "matchTested" : 0,
                                                "children" : [
                                                        {
                                                                "type" : "IXSCAN",
                                                                "works" : 1,
                                                                "yields" : 1,
                                                                "unyields" : 1,
                                                                "invalidates" : 0,
                                                                "advanced" : 1,
                                                                "needTime" : 0,
                                                                "needFetch" : 0,
                                                                "isEOF" : 0,
                                                                "keyPattern" : "{ s: 1, d: -1, v: 1 }",
                                                                "isMultiKey" : 0,
                                                                "boundsVerbose" : "field #0['s']: [78, 78], field #1['d']: [MaxKey, MinKey], field #2['v']: [0, 0]",
                                                                "yieldMovedCursor" : 0,
                                                                "dupsTested" : 0,
                                                                "dupsDropped" : 0,
                                                                "seenInvalidated" : 0,
                                                                "matchTested" : 0,
                                                                "keysExamined" : 1,
                                                                "children" : [ ]
                                                        }
                                                ]
                                        }
                                ]
                        }
                },
                "feedback" : {

                },
                "filterSet" : false
        }
]



Thanks for providing the plan cache output.

I can infer that the following query generated this cache entry: { v: { $in: [0] }, s: 78, b: { $in: [6, 16] }, $or: [ {d: ... }, {d: ... } ] }, and the plan ranker tried out indexes { b: 1, d: -1, r: 1, v: 1 } and { s: 1, d: -1, v: 1 }.  The scan over the { b: 1, ... } index finished immediately and returned zero results, since no document in the collection met the condition { b: { $in: [6, 16] } }.  In contrast, the scan over the { s: 1, ... } index found at least one entry that met the condition { s: 78 }.  Hence the { b: 1, ... } plan actually performed "better" (it scanned the fewest number of index entries during execution), and it was considered the winner and cached for this query shape.

The following ticket is tracking this specific issue (the change in behavior around caching plans for queries that return zero results): <https://jira.mongodb.org/browse/SERVER-15802>.  In 2.6, the query subsystem rewrite changed some of the logic around query plan caching, and we're seeing some users run into issues.  Feel free to add yourself as a watcher to that ticket to track its progress. 

As a workaround, I would suggesting adding a hint() to this query to force it to use the correct index.



Cool, thanks a bunch for looking into this.

We've worked around this and are watching the issue.



YUP. getting hit by this bug too. Queries went from subsecond to well over 10 seconds.

Going to dig in but... at this point sounds like we may have to downgrade to 2.6.4.



Frequently this can be caused by presence of superfluous/unneeded indexes.
Before downgrading check if the index that *is* being picked is a necessary one, if all the query parts are working and/or if you use hint...



This burned us yet again this morning.  It looks like the only way for me to get around this is to specify an index hint for the query.  We're running this query via the aggregation framework though and it looks like hints aren't supported. https://jira.mongodb.org/browse/SERVER-7944  Is that correct?

If that's the case, I don't see a way out of this as dropping an index is not an option for us here.  We'll likely be rolling back to 2.6.4 as well.



While you cannot hit directly, you can sometimes indirectly make it
use the correct index by adding a sort stage along with your match -
they will get combined and the index that best satisfies the
combination will get picked.

For example, in your original example, I'm quite certain that if you
change your aggregation to add {$sort:{ siid: 1, date: -1, dev: 1}}
immediately after the $match stage, you will force the selection of
the correct index with no loss of performance, since the results
already come out sorted when you are getting them from an index.


댓글 없음:

댓글 쓰기