2014년 12월 9일 화요일

Extremely huge query timers in log

Sometimes in log i see very big timers of queries. This occurs on different collections.

namespace                          min (ms)      max (ms)      mean (ms)    95%-ile (ms)    sum (ms)
stats
.api_call                          101    1271310319         67184         1161.0    1279520230
geo
.net_country                  1271310318    1271310318    1271310318   1271310318.0    1271310318
profiles
.author                         101    1271310318         24721          852.0    1287434566
posts
.msg_20141028                      105    1271310319      74783277    254263466.2    1271315725
duplicates
.dup_20141030                 120    1271310319     181616058    889917393.7    1271312409
comments_link
.revcom_20141030    1271310318    1271310318    1271310318   1271310318.0    1271310318
api_access_stats
.api_call               101    1271310317        256415         2766.0    1275155227
api_access_stats
.api_call_ip            101    1271310317        455569        3055.25    1273772879

Data based on mloginfo output. What does it mean?



You have some terribly slow queries - some that take more than 2 weeks. Find out what those queries are. They should trigger lines in the logs with the timing information used by mloginfo, so if that tool doesn't find the queries for you you should be able to do it yourself. Most likely cause is lack of indexes.



nope. Uptime of mongod is less than 1 day



another example of such query:

2014-11-01T17:20:01.687+0300 [conn2146] query geo.net_city query: {$query: { lower_name_ru: "ст профинтерн" }, $orderby: {weight: -1 } } planSummary: IXSCAN { lower_name_ru: 1 }, IXSCAN {lower_name_ru: 1 } ntoreturn:1 ntoskip:0 nscanned:0 nscannedObjects:0keyUpdates:0 numYields:0 locks(micros) r:18446744067802 nreturned:0reslen:20 1271310313ms



Stats info about this collection:
> db.net_city.stats()
{
"sharded" : false,
"primary" : "shard0002",
"ns" : "geo.net_city",
"count" : 91863,
"size" : 32112072,
"avgObjSize" : 349,
"storageSize" : 41988096,
"numExtents" : 8,
"nindexes" : 9,
"lastExtentSize" : 15290368,
"paddingFactor" : 1,
"systemFlags" : 1,
"userFlags" : 0,
"totalIndexSize" : 34780704,
"indexSizes" : {
"_id_" : 3000592,
"id_1" : 2608144,
"country_id_1" : 2918832,
"region_id_1" : 3450272,
"lower_name_ru_1" : 5976656,
"lower_name_en_1" : 4504976,
"latitude_1_longitude_1_sector_1" : 5273520,
"weight_1" : 3703728,
"sector_1" : 3343984
},
"ok" : 1
}



If the server has only been up for a week you must be looking at the wrong log or there is a bug in the slow query metrics because 1271310313ms is 353.142 hours. That is where Will got the query is taking more than a week. It is also strange that the nscanned is 0 for such a slow query. What version of MongoDB is this?

Lets assume that that query is slow and the numbers are just munged... I suspect that the user you have picked in the query has a large number of document in the database with different weight.  The lower_name_ru_1  index finds those documents quickly but then the database has to manually sort the documents it finds to return them.  If you create an index like { lower_name_ru : 1, weight : 1 } I think that query will be much faster. To confirm fir there is a speedup you can try doing an explain on the query before and after the index is created.

For the strategy for designing your indexes based on your query pattern I will point you to this blog post: http://emptysqua.re/blog/optimizing-mongodb-compound-indexes/



I suppose that it is not a very slow query. MongoDB version is 2.6.4
All such queries have very similar timers that differs ±1ms: 1271310317, 1271310318, 1271310319
Another example:
namespace              pattern                                     count    min (ms)    max (ms)      mean (ms)    95%-ile (ms)    sum (ms)

profiles.author        {"_hash": 1}                                47100         101    1271310317        27176           339.0    1280005391
This collection has a unique index by _hash field.
Mean value is 27176 and 95%-ile is 339. This values show that this outstanding max timer is very strage for queries on the same pattern by unique index.

One more example:
namespace              pattern                                count    min (ms)      max (ms)      mean (ms)     95%-ile (ms)    sum (ms)

geo.net_country        {"lower_name_ru": 1}                       1    1271310319    1271310319    1271310319    1271310319.0    1271310319


Geo.net_country is very small collection:
mongos> db.net_country.stats()
{
"sharded" : false,
"primary" : "shard0002",
"ns" : "geo.net_country",
"count" : 241,
"size" : 45540,
"avgObjSize" : 188,
"storageSize" : 61440,
"numExtents" : 2,
"nindexes" : 5,
"lastExtentSize" : 49152,
"paddingFactor" : 1,
"systemFlags" : 1,
"userFlags" : 0,
"totalIndexSize" : 57232,
"indexSizes" : {
"_id_" : 8176,
"id_1" : 8176,
"lower_name_ru_1" : 24528,
"lower_name_en_1" : 8176,
"code_1" : 8176
},
"ok" : 1
}

Detail about request in log:
2014-11-05T11:33:01.953+0300 [conn5937] query geo.net_country query: { lower_name_ru: "россия волгоград" } planSummary: IXSCAN { lower_name_ru: 1 } ntoskip:0 nscanned:0 nscannedObjects:0 keyUpdates:0 numYields:0 locks(micros) r:18446744073553 nreturned:0 reslen:20 1271310319ms

Empty result is normal for this query because this is check of user input.

My point is that cause of this strange timers in different collections is not a slow query, but smth else. I want to find out that cause.



How are you keeping your server times in sync? Also, what sort of virtualization are you using?

I suspect there is some sort of time sync problem and the time is going backwards resulting in these implausible query times. See SERVER-4740 and related issues.



All shards are physical servers without virtualization. All of them are syncing from local ntp server.



We've seen queries take two weeks according to the logs.  It was
*always* as far as I know traced back to the clock moving back for
some very fast queries.  If the elapsed time is measured as negative,
since it's stored in an unsigned, it will turns out to be represented
as a value in the range 1271310318 (don't you think it's a funny
coincidence that you have so many queries that run exactly 1271310319,
1271310317 ms?   Isn't it more likely that they took 0 or 1 or 2 ms
but your clock went slightly back?

Even if you syncing from ntp server, it's possible that rather than
syncing continuously it's set up to sync periodically and if the clock
is running fast on the server then periodically it has to be set
"back"...



Thank you, so it's not a reason to concern. I was afraid that it could be something very bad. 



Well, time going backwards on a machine (even if it's one millisecond
back every day for example) *can* be quite bad, since something might
be detected as failed when it succeeded or vice versa, not necessarily
on the OS level, but on the user program level.   I would check ntpd
configuration to see if maybe this can be fixed easily.


댓글 없음:

댓글 쓰기