We've updated our mongodb from 2.6.3 to 2.6.5 and encountered some strange behavior from the DB.
after the update we had very slow insert rate (~150/s) the DB was faulting a lot and using very little memory ~500mb, we than thought of touching the indexes to memory which seemed to solve the problem for a few minutes and filled the memory as needed.
But after a few minutes the issue returned with the only difference that now mongo did used a lot of memory (~8Gb) but had a very large fault count and high DB lock (~98%) , we run the touch command again and thwe issue was solved again.
Thinking its a reference counter or some Least Recently Used logic issue, we touched the indexes several time into memory which seems to have solved the problem for the last 30 minutes...
Did anyone encounter this kind of issue before?
I`ll be happy to learn what caused the issue in case it would happen again.
It sounds like what you might see if some query runs occasionally that
causes a full collection scan of some large collection - this would
pull data into RAM and possibly displace the working set indexes.
Can you check the logs and see *after* the touch command, what is the
first slow query? Are there any queries with planSummary: COLLSCAN
and very high number of nscanned? (what's high would depend on the
data set/size and size of documents so look for relatively high).
It's also possible that the operation that causes this is indexed
(planSummary: IDXSCAN) but the nscanned number is very high anyway.
The logs in general will tend to have a lot of clues to what's
happening here and why...
causes a full collection scan of some large collection - this would
pull data into RAM and possibly displace the working set indexes.
Can you check the logs and see *after* the touch command, what is the
first slow query? Are there any queries with planSummary: COLLSCAN
and very high number of nscanned? (what's high would depend on the
data set/size and size of documents so look for relatively high).
It's also possible that the operation that causes this is indexed
(planSummary: IDXSCAN) but the nscanned number is very high anyway.
The logs in general will tend to have a lot of clues to what's
happening here and why...
Thanks for the quick help Asyna :)
I checked the logs but it doesn't seem like we had a lot of COLLSCAN or IDXSCAN queries.
Here are the log entries that did contained it that day :
2014-10-19T16:40:09.475+0300 [conn8] query archive.lines planSummary: COLLSCAN cursorid:78908849102 ntoreturn:0 ntoskip:0 nscanned:101 nscannedObjects:101 keyUpdates:0 numYields:20 locks(micros) r:1267 nreturned:101 reslen:81943 191ms
2014-10-19T17:08:19.284+0300 [TTLMonitor] query archive.system.indexes query: { expireAfterSeconds: { $exists: true } } planSummary: COLLSCAN ntoreturn:0 ntoskip:0 nscanned:4 nscannedObjects:4 keyUpdates:0 numYields:1 locks(micros) r:67787 nreturned:0 reslen:20 125ms
There are other queries logged but they're almost all inserts and all look like this :
2014-10-19T17:59:37.100+0300 [conn20] insert archive.lines query: { _id: ObjectId(' 5443d05d10842271188ba6e7')... data.... } ninserted:1 keyUpdates:0 numYields:0 locks(micros) w:105 422ms
2014-10-19T17:59:37.100+0300 [conn17] insert archive.lines query: { _id: ObjectId(' 5443d03410842258188ba91a'), .. . data.... } ninserted:1 keyUpdates:0 numYields:0 locks(micros) w:51 433ms
2014-10-19T17:59:37.100+0300 [conn19] insert archive.lines query: { _id: ObjectId(' 5443d04910842246188ba021')... data.... } ninserted:1 keyUpdates:0 numYields:0 locks(micros) w:63 412ms
2014-10-19T17:59:37.113+0300 [conn14] insert archive.lines query: { _id: ObjectId(' 5443d0168f7ac305ed8ba70f'),... data.... } ninserted:1 keyUpdates:0 numYields:0 locks(micros) w:64 416m
which i guess i were logged only because the disk the system is working on is a little on the slow side...
Also now the DB seem to be running ok and the issue never returned.
Note that all of your inserts are very fast (locks(micros) w:51 433ms
means write lock was held for 51 *micro*seconds but total duration was
433milliseconds which means it spent over 432ms just waiting for its
turn. I suspect there was something else that *was* slow - I would
look for things in the logs that have high values for locks or
possibly data file flushes that took long enough to be logged...
means write lock was held for 51 *micro*seconds but total duration was
433milliseconds which means it spent over 432ms just waiting for its
turn. I suspect there was something else that *was* slow - I would
look for things in the logs that have high values for locks or
possibly data file flushes that took long enough to be logged...
I couldn't find the cause of the slowness but it seems to work fine for now.
댓글 없음:
댓글 쓰기