2014년 12월 14일 일요일

After update to 2.6.5 mongodb doesn't load a collection indexes to memory

We've updated  our mongodb  from 2.6.3 to 2.6.5  and encountered  some  strange  behavior  from the  DB.
after the update we  had  very slow  insert rate (~150/s)  the DB  was  faulting a lot  and using  very little memory  ~500mb, we  than thought  of touching  the indexes to memory  which  seemed to solve the problem for  a few  minutes  and filled the memory  as needed.
But  after a few minutes  the issue  returned with the only difference that now mongo did used a lot of memory (~8Gb)  but had  a very large  fault count  and high DB lock (~98%) , we  run the touch  command again and thwe issue was solved again.
Thinking its a reference counter or some Least Recently Used logic issue,  we touched the indexes several time into  memory  which seems to have solved the problem for the last 30 minutes...

Did anyone encounter  this  kind of issue before?  
I`ll be happy  to learn  what caused the issue in case it would happen again.



It sounds like what you might see if some query runs occasionally that
causes a full collection scan of some large collection - this would
pull data into RAM and possibly displace the working set indexes.

Can you check the logs and see *after* the touch command, what is the
first slow query?   Are there any queries with planSummary: COLLSCAN
and very high number of nscanned?  (what's high would depend on the
data set/size and size of documents so look for relatively high).

It's also possible that the operation that causes this is indexed
(planSummary: IDXSCAN) but the nscanned number is very high anyway.

The logs in general will tend to have a lot of clues to what's
happening here and why...



Thanks for the quick help Asyna :)

I checked the logs but it doesn't seem like we had a lot of  COLLSCAN or IDXSCAN queries.  
Here are the log entries that did contained it that day :
2014-10-19T16:40:09.475+0300 [conn8] query archive.lines planSummary: COLLSCAN cursorid:78908849102 ntoreturn:0 ntoskip:0 nscanned:101 nscannedObjects:101 keyUpdates:0 numYields:20 locks(micros) r:1267 nreturned:101 reslen:81943 191ms
2014-10-19T17:08:19.284+0300 [TTLMonitor] query archive.system.indexes query: { expireAfterSeconds: { $exists: true } } planSummary: COLLSCAN ntoreturn:0 ntoskip:0 nscanned:4 nscannedObjects:4 keyUpdates:0 numYields:1 locks(micros) r:67787 nreturned:0 reslen:20 125ms

There are other  queries logged  but they're almost all inserts and all look like this :
2014-10-19T17:59:37.100+0300 [conn20] insert archive.lines query: { _id: ObjectId('5443d05d10842271188ba6e7')... data....  } ninserted:1 keyUpdates:0 numYields:0 locks(micros) w:105 422ms
2014-10-19T17:59:37.100+0300 [conn17] insert archive.lines query: { _id: ObjectId('5443d03410842258188ba91a'), ... data....  } ninserted:1 keyUpdates:0 numYields:0 locks(micros) w:51 433ms
2014-10-19T17:59:37.100+0300 [conn19] insert archive.lines query: { _id: ObjectId('5443d04910842246188ba021')... data....  } ninserted:1 keyUpdates:0 numYields:0 locks(micros) w:63 412ms
2014-10-19T17:59:37.113+0300 [conn14] insert archive.lines query: { _id: ObjectId('5443d0168f7ac305ed8ba70f'),... data.... } ninserted:1 keyUpdates:0 numYields:0 locks(micros) w:64 416m

which i guess i were logged only  because the disk the system is working on is a little on the slow side...

Also now the DB seem to be running ok and the issue never returned.



Note that all of your inserts are very fast (locks(micros) w:51 433ms
means write lock was held for 51 *micro*seconds but total duration was
433milliseconds which means it spent over 432ms just waiting for its
turn.   I suspect there was something else that *was* slow - I would
look for things in the logs that have high values for locks or
possibly data file flushes that took long enough to be logged...



I couldn't  find the cause of the slowness but it seems to work  fine for now.



댓글 없음:

댓글 쓰기