2014년 12월 14일 일요일

Question on read-performance (while searching using a regex - to simulate like/containstext operator)

Probably this is addressed already, let me pose the question anyways, thanks in advance!

I have a collection of about 5 million Customer (with name, contact-person etc) documents. I have indexed name and contact-person fields. Now, when I try to find all the customers with name that contains text "Name1000", I see entire 5M docuements are scanned:

> db.customers.find({name:{$regex:"Name10000"}},{name:1,_id:1}).explain()
{
        "cursor" : "BtreeCursor name_1",
        "isMultiKey" : false,
        "n" : 111,
        "nscannedObjects" : 111,
        "nscanned" : 5000000,
        "nscannedObjectsAllPlans" : 111,
        "nscannedAllPlans" : 5000000,
        "scanAndOrder" : false,
        "indexOnly" : false,
        "nYields" : 39062,
        "nChunkSkips" : 0,
        "millis" : 5148,
        "indexBounds" : {
                "name" : [
                        [
                                "",
                                {

                                }
                        ],
                        [
                                /Name10000/,
                                /Name10000/
                        ]
                ]
        },
        "server" : "myhost:27017",
        "filterSet" : false
}
>  

Now, the question is, is this expected? Or anything wrong with the way I do the query/indexes? Basically, this read performance is not in the expected range and I am looking for ways to make this better (I just have it on once machine now, but, can Sharding help get a better query performance? anything else I should try?).

Thanks so much and have a nice day!



It's expected. MongoDB must scan every value in the name index to find which documents match the regex expression, for almost all types of regular expressions. MongoDB can more efficiently use an index for left-anchored regular expressions like /^Name10000/ by turning the regex condition into a range condition. Index behavior with regex conditions is explained in the $regex docs.

To improve the performance of the query, you should try to eliminate the regex condition. It seems likely you can match the customer name exactly instead of using a regex expression, or, if you must be able to match Name10000 anywhere in a longer name string, consider using a text index, which will perform much better than a full index scan.



Thanks Will. I have dropped the existing index and created a text index (maybe I am wrong here but it seems like I could use only one text index in a query, which is actually a limitation for the use case I have). 

There are a lot of customers with name with the string "name1000", but apparently the following query is returning only one document (with the name starting from "name1000"), is this expected? if so, I wonder how I can make the query to search for all customers with name that contains the text "name1000" (I tried with $search: "*name1000*" with no luck). 


> db.ROEntity.find( { $text: { $search: "name1000" } } ).explain()
{
        "cursor" : "TextCursor",
        "n" : 1,
        "nscannedObjects" : 1,
        "nscanned" : 1,
        "nscannedObjectsAllPlans" : 1,
        "nscannedAllPlans" : 1,
        "scanAndOrder" : false,
        "nYields" : 0,
        "nChunkSkips" : 0,
        "millis" : 0,
        "server" : "myhost:27017",
        "filterSet" : false
}
>

Thanks so much!



What was wrong with the previous indexed query?

It was not examining 5M documents - it was only looking at the 111 documents that matched your condition.  All of the filtering was done from the index. 

        "cursor" : "BtreeCursor name_1",
        "n" : 111,
        "nscannedObjects" : 111,
        "nscanned" : 5000000,



As you could see, the thing is the response time:

 "millis" : 5148,

So, I am looking to optimize this query. The observation is that when the regex for containstext is used, this is the output of explain():

  > db.ROEntity.find({name:{$regex:"Name1000"}},{name:1,_id:1}).explain()
{
        "cursor" : "BtreeCursor name_1",
        "isMultiKey" : false,
        "n" : 111,
        "nscannedObjects" : 111,
        "nscanned" : 5000000,
        "nscannedObjectsAllPlans" : 111,
        "nscannedAllPlans" : 5000000,
        "scanAndOrder" : false,
        "indexOnly" : false,
        "nYields" : 39062,
        "nChunkSkips" : 0,
        "millis" : 5118,
        ....
        ....


But when the regex for startswith is used, this is the output of explain():

> db.ROEntity.find({name:{$regex:"^Name1000"}},{name:1,_id:1}).explain()
{
        "cursor" : "BtreeCursor name_1",
        "isMultiKey" : false,
        "n" : 111,
        "nscannedObjects" : 111,
        "nscanned" : 112,
        "nscannedObjectsAllPlans" : 111,
        "nscannedAllPlans" : 112,
        "scanAndOrder" : false,
        "indexOnly" : false,
        "nYields" : 0,
        "nChunkSkips" : 0,
        "millis" : 1,

I am still new to Mongo, but the response time, as you can see above is very good. But, the above is a starts-with search (not a contains text). 

Basically, we want to be able to search this 5M documents given partial name, in Oracle for instance, a query like below would do the trick:

select * from customers where name like '%Name1000%'  

Thanks!



Why not setup a text index, using regex on a normal index is never going to be as fast as a text index, plus text indexes support stemming and case insensitivity



Thank you. It is not only about index/performance, it is about the need: we would like to search customers using a partial name. I actually tried it with text index also (please see above post). 

I kind of realize that MongoDB's performance is very poor with this kind of queries:

db.customers.find({name:{$regex:"Name1000"}},{name:1,_id:1})

select * from customers where name like '%Name1000%'  

though there is an index for name.


댓글 없음:

댓글 쓰기