2014년 12월 9일 화요일

Strange behaviour with compound multikey indexes and ranges.

I'm experiencing a very strange behavior when dealing with multikey indexes.

There seems to be no way to get a query to use a compound multikey index efficient by querying by a specified range.

Setting up the collection:

db.foo.insert({'MultiKey': [1,2], 'Range': 100000});
db.foo.insert({'MultiKey': 6, 'Range': 100});
db.foo.insert({'MultiKey': [5,2], 'Range': 56000});
db.foo.insert({'MultiKey': [1,2,8], 'Range': 10000});

db.foo.ensureIndex({'MultiKey': 1,'Range': 1});


When querying a compound multikey index and limiting the result by a specified range it will ignore the "$gt" entry and use "-Infinity" instead.

db.foo.find({'MultiKey': 1, 'Range': {$gt: 10, $lt: 100}}).explain(true);

"indexBounds" : {
"MultiKey" : [[1, 1]],
"Range" : [[-Infinity,100]]
}


If I remove the "$lt" entry it will use the "$gt" value and then use "Infinity" as the "$lt" value.

db.foo.find({'MultiKey': 1, 'Range': {$gt: 10}}).explain(true);

"indexBounds" : {
"MultiKey" : [[1, 1]],
"Range" : [[10, Infinity]]
}


I'm using MongoDB 2.6, the same issue exists in 2.4.6 but there the behavior is reversed and the '$gt' value is always prioritized.
This seems like a very strange behavior and I haven't found any documentations as to why this is.



There are Jira tickets explaining why this works this way - basically it comes down to the fact that the optimizer has no way to know which of the fields in the multiKey index is the array.   If it could be Range then these bounds would be correct not to miss matching documents.



Ok, didn't find those tickets.

Is there anyway to get the optimizer to keep the "$gt" instead of the "$lt" for the index scan?



Try the arguments in the other order ($lt and $gt).
I'm on my phone but will look up the tickets when I'm at my computer - or maybe someone else will post a link.



It doesn't seem to matter which order I put them in.

This is quite annoying as it's preventing me from upgrading to the latest MongoDB. 
Even though it's no ideal with only one index boundary my app worked fine on previous versions since the issue for some reason seems to have been reversed there.

I've read through the different explanations for this behavior and I do understand why this is an issue on a multikey index, but I don't really see how this apply to the non-multikey part of the index.



> I've read through the different explanations for this behavior and I do
> understand why this is an issue on a multikey index, but I don't really see
> how this apply to the non-multikey part of the index.
I'm glad you found the discussions of the limitation here.  The reason this applies to "non-multikey part" of the index is that MongoDB does not know which parts of the index are array and which aren't.

In fact, it's perfectly legal to have this:

db.collection.ensureIndex({a:1,b:1})
db.collection.insert({a:[1,2,3], b:5})
db.collection.insert({a:9, b:[7,8,9]})

The a:1,b:1 index is multikey index and here *both* a and b can be an array - so index that is flagged multikey just says "one or some or all of the parts of this index have an array for value in at least one document".

Not only is it the case that the optimizer does not know that in your case Range is *not* an array, theoretically, "Range" could be an array along with "MultiKey" field being in array (just in different documents).

This is a penalty for allowing indexing multiple fields where more than one of them could be an array.  

Even if the optimizer maintained stats on which fields in the index have which type values (which it might do in the future, but does not do currently) it would then have to select a different query plan depending on whether one or more than one field in the index actually contains arrays.

Here is why it *must* use only one of the bounds when it does not know that Range is *not* an array.

Let's add one document to your sample data:

> db.foo.insert({MultiKey:1, Range:[5, 500]})
WriteResult({ "nInserted" : 1 })
> db.foo.find({'MultiKey': 1, 'Range': {$gt: 10, $lt: 100}}).explain();
{
       "cursor" : "BtreeCursor MultiKey_1_Range_1",
       "isMultiKey" : true,
       "n" : 1,
       ...
       "indexBounds" : {
             "MultiKey" : [
               [
                    1,
                    1
               ]
             ],
            "Range" : [
               [
                    -Infinity,
                    100
               ]
             ]
         }
}
> db.foo.find({'MultiKey': 1, 'Range': {$gt: 10, $lt: 100}}){ "_id" : ObjectId("5485dbcd85e259d8b3d296fe"), "MultiKey" : 1, "Range" : [ 5, 500 ] }

If we used the bounds on both sides, we would *miss* this document and the way that the query language specifies this, Range: [5,500] matches both $lt:100 and $gt:10.
The way to specify that you want a single element to match both is to use $elemMatch, and now the query optimizer knows it can use both boundaries:

> db.foo.find({'MultiKey': 1, 'Range': {$elemMatch:{$gt: 10, $lt: 100}}}).explain()
{
     "cursor" : "BtreeCursor MultiKey_1_Range_1",
     "isMultiKey" : true,
     ...
     "indexBounds" : {
     "MultiKey" : [
        [
            1,
            1
        ]
      ],
      "Range" : [
        [
            10,
            100
        ]
      ]
     }
}

Unfortunately $elemMatch will not match a *non* array element, so the solution here might be to vote up SERVER-6050 to be able to specify $elemMatch in your query and have narrowest possible bounds be used.



Thank you for taking the time to explain the issue, I really appreciate it.

I see now why both bounds can't be used, I would however like to see a way to control which one the optimizer chooses as it doesn't seem possible as of now.



I managed to find a workaround, it's far from pretty but it get the job done. =)

By using $or you can write the query without the $lt bound and thereby force the $gt bound to be used. 

db.foo.find({
$or : [
{'MultiKey': 1,'Range': {$gt: 10}},
{'_id':0} // Dummy query to force the or clause to be used.
],
'Range': {$gt: 10, $lt: 100}
}).explain(true);


댓글 없음:

댓글 쓰기