namespace min (ms) max (ms) mean (ms) 95%-ile (ms) sum (ms)
stats.api_call 101 1271310319 67184 1161.0 1279520230
geo.net_country 1271310318 1271310318 1271310318 1271310318.0 1271310318
profiles.author 101 1271310318 24721 852.0 1287434566
posts.msg_20141028 105 1271310319 74783277 254263466.2 1271315725
duplicates.dup_20141030 120 1271310319 181616058 889917393.7 1271312409
comments_link.revcom_20141030 1271310318 1271310318 1271310318 1271310318.0 1271310318
api_access_stats.api_call 101 1271310317 256415 2766.0 1275155227
api_access_stats.api_call_ip 101 1271310317 455569 3055.25 1273772879
Data based on mloginfo output. What does it mean?
You have some terribly slow queries - some that take more than 2 weeks. Find out what those queries are. They should trigger lines in the logs with the timing information used by mloginfo, so if that tool doesn't find the queries for you you should be able to do it yourself. Most likely cause is lack of indexes.
nope. Uptime of mongod is less than 1 day
another example of such query:
2014-11-01T17:20:01.687+0300 [conn2146] query geo.net_city query: {$query: { lower_name_ru: "ст профинтерн" }, $orderby: {weight: -1 } } planSummary: IXSCAN { lower_name_ru: 1 }, IXSCAN {lower_name_ru: 1 } ntoreturn:1 ntoskip:0 nscanned:0 nscannedObjects:0keyUpdates:0 numYields:0 locks(micros) r:18446744067802 nreturned:0reslen:20 1271310313ms
Stats info about this collection:
> db.net_city.stats()
{
"sharded" : false,
"primary" : "shard0002",
"ns" : "geo.net_city",
"count" : 91863,
"size" : 32112072,
"avgObjSize" : 349,
"storageSize" : 41988096,
"numExtents" : 8,
"nindexes" : 9,
"lastExtentSize" : 15290368,
"paddingFactor" : 1,
"systemFlags" : 1,
"userFlags" : 0,
"totalIndexSize" : 34780704,
"indexSizes" : {
"_id_" : 3000592,
"id_1" : 2608144,
"country_id_1" : 2918832,
"region_id_1" : 3450272,
"lower_name_ru_1" : 5976656,
"lower_name_en_1" : 4504976,
"latitude_1_longitude_1_ sector_1" : 5273520,
"weight_1" : 3703728,
"sector_1" : 3343984
},
"ok" : 1
}
If the server has only been up for a week you must be looking at the wrong log or there is a bug in the slow query metrics because 1271310313ms is 353.142 hours. That is where Will got the query is taking more than a week. It is also strange that the nscanned is 0 for such a slow query. What version of MongoDB is this?
Lets assume that that query is slow and the numbers are just munged... I suspect that the user you have picked in the query has a large number of document in the database with different weight. The lower_name_ru_1 index finds those documents quickly but then the database has to manually sort the documents it finds to return them. If you create an index like { lower_name_ru : 1, weight : 1 } I think that query will be much faster. To confirm fir there is a speedup you can try doing an explain on the query before and after the index is created.
For the strategy for designing your indexes based on your query pattern I will point you to this blog post: http://emptysqua.re/blog/ optimizing-mongodb-compound- indexes/
I suppose that it is not a very slow query. MongoDB version is 2.6.4
This collection has a unique index by _hash field.
Geo.net_country is very small collection:
All such queries have very similar timers that differs ±1ms: 1271310317, 1271310318, 1271310319
Another example:
namespace pattern count min (ms) max (ms) mean (ms) 95%-ile (ms) sum (ms)
profiles.author {"_hash": 1} 47100 101 1271310317 27176 339.0 1280005391
Mean value is 27176 and 95%-ile is 339. This values show that this outstanding max timer is very strage for queries on the same pattern by unique index.
One more example:
One more example:
namespace pattern count min (ms) max (ms) mean (ms) 95%-ile (ms) sum (ms)
geo.net_country {"lower_name_ru": 1} 1 1271310319 1271310319 1271310319 1271310319.0 1271310319
Geo.net_country is very small collection:
mongos> db.net_country.stats()
{
"sharded" : false,
"primary" : "shard0002",
"ns" : "geo.net_country",
"count" : 241,
"size" : 45540,
"avgObjSize" : 188,
"storageSize" : 61440,
"numExtents" : 2,
"nindexes" : 5,
"lastExtentSize" : 49152,
"paddingFactor" : 1,
"systemFlags" : 1,
"userFlags" : 0,
"totalIndexSize" : 57232,
"indexSizes" : {
"_id_" : 8176,
"id_1" : 8176,
"lower_name_ru_1" : 24528,
"lower_name_en_1" : 8176,
"code_1" : 8176
},
"ok" : 1
}
Detail about request in log:
2014-11-05T11:33:01.953+0300 [conn5937] query geo.net_country query: { lower_name_ru: "россия волгоград" } planSummary: IXSCAN { lower_name_ru: 1 } ntoskip:0 nscanned:0 nscannedObjects:0 keyUpdates:0 numYields:0 locks(micros) r:18446744073553 nreturned:0 reslen:20 1271310319ms
Empty result is normal for this query because this is check of user input.
My point is that cause of this strange timers in different collections is not a slow query, but smth else. I want to find out that cause.
How are you keeping your server times in sync? Also, what sort of virtualization are you using?
I suspect there is some sort of time sync problem and the time is going backwards resulting in these implausible query times. See SERVER-4740 and related issues.
All shards are physical servers without virtualization. All of them are syncing from local ntp server.
We've seen queries take two weeks according to the logs. It was
*always* as far as I know traced back to the clock moving back for
some very fast queries. If the elapsed time is measured as negative,
since it's stored in an unsigned, it will turns out to be represented
as a value in the range 1271310318 (don't you think it's a funny
coincidence that you have so many queries that run exactly 1271310319,
1271310317 ms? Isn't it more likely that they took 0 or 1 or 2 ms
but your clock went slightly back?
Even if you syncing from ntp server, it's possible that rather than
syncing continuously it's set up to sync periodically and if the clock
is running fast on the server then periodically it has to be set
"back"...
*always* as far as I know traced back to the clock moving back for
some very fast queries. If the elapsed time is measured as negative,
since it's stored in an unsigned, it will turns out to be represented
as a value in the range 1271310318 (don't you think it's a funny
coincidence that you have so many queries that run exactly 1271310319,
1271310317 ms? Isn't it more likely that they took 0 or 1 or 2 ms
but your clock went slightly back?
Even if you syncing from ntp server, it's possible that rather than
syncing continuously it's set up to sync periodically and if the clock
is running fast on the server then periodically it has to be set
"back"...
Thank you, so it's not a reason to concern. I was afraid that it could be something very bad.
Well, time going backwards on a machine (even if it's one millisecond
back every day for example) *can* be quite bad, since something might
be detected as failed when it succeeded or vice versa, not necessarily
on the OS level, but on the user program level. I would check ntpd
configuration to see if maybe this can be fixed easily.
back every day for example) *can* be quite bad, since something might
be detected as failed when it succeeded or vice versa, not necessarily
on the OS level, but on the user program level. I would check ntpd
configuration to see if maybe this can be fixed easily.
댓글 없음:
댓글 쓰기