2014년 12월 9일 화요일

MongoDB Log Level

I'm having trouble finding this in the documentation. What are the various log levels (also known as verbosity). I've tried looking for this in the documentation but maybe I'm looking in the wrong place. I know there are levels 0-5 but aside from that I haven't found a description of the different levels. 

It seems that 0 is at warning level and only logs information that are warnings unless you set the profiling level differently and then it'll log slow queries too.

I turned up the log level to 1 because i'm trying to track a heartbeat issue but it seems to be logging all commands, reads/writes but i don't really see any more heartbeat information being logged on the primary. Also my primary log file rocketed up to 5GB with it at a logging level of 1 for half a day which seems really unusual given that the logs are normally around 2MB.

So what do the different log levels actually stand for and what information is tracked? And how much should I expect them to grow if I turn them up?



Log level 0 is the default and unless you are debugging a specific issue with your mongod/mongos you should leave it at that level.  Slow queries will still be logged at level 0, once they go beyond the configured slowms threshold (default 100ms).  The logs get very chatty with any increase, but especially at the higher levels.

There is an open issue to document the differences between the log levels, but things need to be cleaned up a bit in the core database before that can happen in a sane way.  The Jira issue for that (if you would like to watch/vote) is DOCS-3678.

For now, unless you have a real need for increased logging, the recommendation is to leave it at the default level of 0.



Yes I've noticed the chattiness increase with an increased log level. I guess I'm just confused that there doesn't seem to be any extra heartbeat information, simply that more queries are being logged (at least from what I've seen in my logs so far).
I would like to increase my log level because I'm trying to trouble shoot an issue where my primary server is slow to respond to my secondary and arbiter but we've done a wireshark trace on all the traffic between the two and determined it is not an infrastructure issue, the network looks perfectly healthy at the time of this slowness. So I want to see if there is any information available from the database since the boxes are new and look fine as well. I would just change thee timeout for fail over to a slightly longer timeout since we are getting so many false positives but I don't believe this is configurable. So I'd like to increase logging to get to the bottom of the issue but I'd prefer not to have insanely large logs if I'm not going to get the information I need from them.



When you say "slow to respond" what do you mean? Are you talking about replica set hearbeats, or some other operation?

Also, what version are you using?  Those heartbeats, as of version 2.4.5 at least, don't require a lock even with authentication enabled (if you are using an older version than 2.4.5 and it is heartbeatsI would recommend upgrading).

Generally, you don't need to have debugging enabled to figure out what is causing replica set heartbeats to be slow, because (since they are so lightweight) if they are slow then lots of other things will be too, there will be no lack of symptoms to choose from.  



When we looked through the trace we could see clear examples where the replica at heartbeats were, in fact, slow to respond. It would take 14-16s before the primary would send a response even though we can see it ack the request from the secondary (or arbiter). And then we would see an election go through. At any rate, we haven't seen any latency in our application or any monitoring on the mongos, when looking in the logs it looks pretty normal. So it's curious for us that we don't see other symptoms but we do see this happen every few weeks. Typically we don't have any problems but recently there was a bug (which has been fixed i believe) where mongo wouldn't rollback replica set transactions properly. And I'm not sure which build its been fixed in, but we are running 2.6.1 and just had it happen a couple weeks ago so it must be after the build we have. Aside from this heartbeat issue we haven't run into any other problems with our cluster. So it would be nice to iron this out so that we have a stable cluster again.



Well, it doesn't sound like an issue I recall from the earlier 2.6 builds, but I may just not remember the issue in question.  There are >160 issues closed between 2.6.1 and 2.6.5 - I'll have a look through them and see if any can account for such a slow response from a heartbeat that I might have missed.  Is an upgrade to 2.6.5 something you would be willing to try? We don't have a specific smoking gun, but before we look at trying to investigate a new potential bug it is always advisable to be on the latest point release of the build to make sure the issue has not already be fixed.

It would certainly be interesting to get a look at the traces showing the slow responses also, though I understand that might not be possible.



I can post the traces when I get back in the office on Monday if that will help.
And sorry to not be clear. The bug was that the primary that stepped down because of slow heartbeats wasn't able to rollback the transactions and we ended up having to manually sync because it just shut down.
I would be willing to try upgrading to 2.6.5 it would just require a little testing beforehand since we do have production systems running on it.



Ok, I'm attaching the traces here. The traces are two from the original primary and one from the original secondary server. You'll see that the blue highlighted part shows where the primary responds to the secondary but it takes about 16 seconds do to so. And then with the trace to the arbiter you'll see the highlighted part where it finally replied to the arbiter after 35 seconds and the arbiter had already closed the TCP connection. The red highlights are where the original secondary and the arbiter reset the connection with the original primary.

These are the logs from the original primary at the time which I don't feel show any reason for it to be slow.


2014-11-20T16:59:06.575-0600 [conn60120] command admin.$cmd command: serverStatus { serverStatus: 1 } ntoreturn:1 keyUpdates:0 numYields:0 locks(micros) r:39 reslen:3656 866ms
2014-11-20T16:59:07.579-0600 [conn60120] endconnection xx.x.xxx.xxx:xxxxx (173 connections now open)
2014-11-20T16:59:08.528-0600 [conn60124] endconnection xx.x.xxx.xxx:xxxxx (172 connections now open)

2014-11-20T16:59:29.613-0600 [clientcursormon] mem (MB) res:236057virt:3745418
2014-11-20T16:59:29.613-0600 [clientcursormon]  mapped (incl journal view):3737954
2014-11-20T16:59:29.613-0600 [clientcursormon]  connections:172
2014-11-20T16:59:29.613-0600 [clientcursormon]  replication threads:32
2014-11-20T16:59:30.754-0600 [conn57521] endconnection xx.x.xxx.xxx:xxxxx (171 connections now open)
2014-11-20T17:00:00.433-0600 [initandlisten] connection acceptedfrom xx.x.xxx.xxx:xxxxx #60126 (172 connections now open)
2014-11-20T17:00:00.433-0600 [initandlisten] connection acceptedfrom xx.x.xxx.xxx:xxxxx #60127 (173 connections now open)
2014-11-20T17:00:00.433-0600 [conn60125] endconnection xx.x.xxx.xxx:xxxxx (172 connections now open)
2014-11-20T17:00:00.433-0600 [initandlisten] connection acceptedfrom xx.x.xxx.xxx:xxxxx #60128 (174 connections now open)
2014-11-20T17:00:00.433-0600 [conn60126] endconnection xx.x.xxx.xxx:xxxxx (172 connections now open)
2014-11-20T17:00:00.433-0600 [initandlisten] connection acceptedfrom xx.x.xxx.xxx:xxxxx #60129 (174 connections now open)
2014-11-20T17:00:00.433-0600 [initandlisten] connection acceptedfrom xx.x.xxx.xxx:xxxxx #60130 (174 connections now open)
2014-11-20T17:00:00.433-0600 [initandlisten] connection acceptedfrom xx.x.xxx.xxx:xxxxx #60131 (175 connections now open)
2014-11-20T17:00:00.433-0600 [conn60127] endconnection xx.x.xxx.xxx:xxxxx (174 connections now open)
2014-11-20T17:00:00.433-0600 [conn60128] endconnection xx.x.xxx.xxx:xxxxx (174 connections now open)
2014-11-20T17:00:00.434-0600 [conn60130] endconnection xx.x.xxx.xxx:xxxxx (174 connections now open)
2014-11-20T17:00:00.433-0600 [conn60129] endconnection xx.x.xxx.xxx:xxxxx (174 connections now open)
2014-11-20T17:00:00.433-0600 [initandlisten] connection acceptedfrom xx.x.xxx.xxx:xxxxx #60132 (176 connections now open)
2014-11-20T17:00:00.434-0600 [conn60131] endconnection xx.x.xxx.xxx:xxxxx (171 connections now open)
2014-11-20T17:00:00.434-0600 [initandlisten] connection acceptedfrom xx.x.xxx.xxx:xxxxx #60133 (173 connections now open)
2014-11-20T17:00:00.434-0600 [initandlisten] connection acceptedfrom xx.x.xxx.xxx:xxxxx #60134 (173 connections now open)
2014-11-20T17:00:00.434-0600 [conn60132] endconnection xx.x.xxx.xxx:xxxxx (172 connections now open)
2014-11-20T17:00:00.434-0600 [conn60134]  authenticate db: local {authenticate: 1, nonce: "xxx", user: "__system", key: "xxx" }
2014-11-20T17:00:00.435-0600 [rsHealthPoll] replset info ARBITER:27017thinks that we are down2014-11-20T17:00:00.477-0600 [rsHealthPoll] replSet member MONGOSERVER2:27017 is now in state PRIMARY2014-11-20T17:00:00.477-0600 [rsMgr] stepping down; another primary was elected more recently2014-11-20T17:00:00.478-0600 [rsMgr] replSet relinquishing primary state2014-11-20T17:00:00.478-0600 [rsMgr] replSet SECONDARY2014-11-20T17:00:00.478-0600 [rsMgr] replSet closing client sockets after relinquishing primary

As I said before I'm not unwilling to try 2.6.5 but I just don't know if it will resolve the issues that I've been seeing and sine I have a production system up and running currently, so I'd need to test before I can make the upgrade.






Also along with the logs and the original primary server these are the logs from the secondary server and the arbiter.

original secondary (MONGOSERVER2) logs
2014-11-20T16:59:01.735-0600 [conn59639]  authenticate db: admin {authenticate: 1, nonce: "xxx", user: "svcClusterAdmin", key: "xxx" }
2014-11-20T16:59:02.000-0600 [repl index builder 30]            IndexBuild(background): 36200500/124435801     29%
2014-11-20T16:59:05.003-0600 [repl index builder 30]            IndexBuild(background): 36577700/124435801     29%
2014-11-20T16:59:08.000-0600 [repl index builder 30]            IndexBuild(background): 36958700/124435801     29%
2014-11-20T16:59:08.528-0600 [conn59637] end connection10.0.240.250:59921 (70 connections now open)
2014-11-20T16:59:11.000-0600 [repl index builder 30]            IndexBuild(background): 37338400/124435801     30%
2014-11-20T16:59:14.000-0600 [repl index builder 30]            IndexBuild(background): 37697700/124435801     30%
2014-11-20T16:59:17.010-0600 [repl index builder 30]            IndexBuild(background): 38079300/124435801     30%
2014-11-20T16:59:18.527-0600 [rsHealthPoll] DBClientCursor::init call()failed2014-11-20T16:59:18.547-0600 [rsHealthPoll] can't authenticate to CHILXPROD061:27017 (10.0.240.250) failed as internal user, error: DBClientBase::findN: transport error: CHILXPROD061:27017 ns: local.$cmd query: { getnonce: 1 }
2014-11-20T16:59:20.000-0600 [repl index builder 30]            Index Build(background): 38458700/124435801     30%
2014-11-20T16:59:23.000-0600 [repl index builder 30]            Index Build(background): 38835100/124435801     31%
2014-11-20T16:59:26.000-0600 [repl index builder 30]            Index Build(background): 39211100/124435801     31%
2014-11-20T16:59:26.645-0600 [conn59638] end connection xxx.xx.xxx.xxx:xxxxx (69 connections now open)
2014-11-20T16:59:26.645-0600 [initandlisten] connection accepted from xxx.xx.xxx.xxx:xxxxx #59640 (70 connections now open)
2014-11-20T16:59:26.678-0600 [conn59640]  authenticate db: local { authenticate: 1, nonce: "xxx", user: "__system", key: "xxx" }
2014-11-20T16:59:28.549-0600 [rsHealthPoll] DBClientCursor::init call() failed
2014-11-20T16:59:28.556-0600 [rsHealthPoll] replSet info MONGOSERVER1:27017 is down (or slow to respond):
2014-11-20T16:59:28.556-0600 [rsHealthPoll] replSet member MONGOSERVER1:27017 is now in state DOWN
2014-11-20T16:59:28.556-0600 [rsMgr] replSet info electSelf 1
2014-11-20T16:59:29.000-0600 [repl index builder 30]            Index Build(background): 39581200/124435801     31%
2014-11-20T16:59:30.766-0600 [rsMgr] replSet PRIMARY
2014-11-20T16:59:32.000-0600 [repl index builder 30]            Index Build(background): 39965000/124435801     32%
2014-11-20T16:59:33.972-0600 [conn59639] command admin.$cmd command: serverStatus { serverStatus: 1 } ntoreturn:1 keyUpdates:0 numYields:0 locks(micros) r:49 reslen:3608 154ms
2014-11-20T16:59:35.000-0600 [repl index builder 30]            Index Build(background): 40289100/124435801     32%
2014-11-20T16:59:38.000-0600 [repl index builder 30]            Index Build(background): 40679700/124435801     32%
2014-11-20T16:59:38.735-0600 [journal] old journal file will be removed: /opt/mongodata/journal/j._87
2014-11-20T16:59:40.597-0600 [rsHealthPoll] DBClientCursor::init call() failed
2014-11-20T16:59:41.000-0600 [repl index builder 30]            Index Build(background): 41064300/124435801     33%
2014-11-20T16:59:44.000-0600 [repl index builder 30]            Index Build(background): 41449500/124435801     33%
2014-11-20T16:59:47.000-0600 [repl index builder 30]            Index Build(background): 41858100/124435801     33%
2014-11-20T16:59:50.000-0600 [repl index builder 30]            Index Build(background): 42233500/124435801     33%
2014-11-20T16:59:52.598-0600 [rsHealthPoll] DBClientCursor::init call() failed
2014-11-20T16:59:53.000-0600 [repl index builder 30]            Index Build(background): 42613000/124435801     34%
2014-11-20T16:59:56.000-0600 [repl index builder 30]            Index Build(background): 42954300/124435801     34%
2014-11-20T16:59:56.685-0600 [conn59640] end connection xxx.xx.xxx.xxx:xxxxx (69 connections now open)
2014-11-20T16:59:56.686-0600 [initandlisten] connection accepted from10.0.104.189:50816 #59641 (70 connections now open)
2014-11-20T16:59:56.726-0600 [conn59641]  authenticate db: local { authenticate: 1, nonce: "xxx", user: "__system", key: "xxx" }
2014-11-20T16:59:59.000-0600 [repl index builder 30]            Index Build(background): 43317900/124435801     34%
2014-11-20T17:00:00.435-0600 [rsHealthPoll] replSet member MONGOSERVER1:27017 is up
2014-11-20T17:00:00.435-0600 [rsHealthPoll] replSet member MONGOSERVER1:27017 is now in state PRIMARY
2014-11-20T17:00:00.435-0600 [rsMgr] another PRIMARY detected but it should step down since it was elected earlier than me
2014-11-20T17:00:00.435-0600 [rsMgr] another PRIMARY detected but it should step down since it was elected earlier than me

arbiter logs
2014-11-20T16:58:50.711-0600 [conn41035]  authenticate db: local {authenticate: 1, nonce: "xxx", user: "__system", key: "xxx" }
2014-11-20T16:59:13.954-0600 [rsHealthPoll] DBClientCursor::init call()failed2014-11-20T16:59:13.954-0600 [rsHealthPoll] can't authenticate to CHILXPROD061:27017 (10.0.240.250) failed as internal user, error: DBClientBase::findN: transport error: CHILXPROD061:27017 ns: local.$cmd query: { getnonce: 1 }
2014-11-20T16:59:20.741-0600 [conn41036]  authenticate db: local { authenticate: 1, nonce: "xxx", user: "__system", key: "xxx" }
2014-11-20T16:59:23.955-0600 [rsHealthPoll] DBClientCursor::init call() failed
2014-11-20T16:59:23.955-0600 [rsHealthPoll] replSet info MONGOSERVER1:27017 is down (or slow to respond):
2014-11-20T16:59:23.955-0600 [rsHealthPoll] replSet member MONGOSERVER1:27017 is now in state DOWN
2014-11-20T16:59:28.556-0600 [conn41036] replSet info voting yea for CHILXPROD062:27017 (1)
2014-11-20T16:59:32.680-0600 [rsHealthPoll] replSet member MONGOSERVER2:27017 is now in state PRIMARY
2014-11-20T16:59:35.956-0600 [rsHealthPoll] DBClientCursor::init call() failed
2014-11-20T16:59:47.958-0600 [rsHealthPoll] DBClientCursor::init call() failed
2014-11-20T16:59:50.790-0600 [conn41037]  authenticate db: local { authenticate: 1, nonce: "xxx", user: "__system", key: "xxx" }
2014-11-20T16:59:59.959-0600 [rsHealthPoll] DBClientCursor::init call() failed
2014-11-20T17:00:00.434-0600 [conn41038]  authenticate db: local { authenticate: 1, nonce: "xxx", user: "__system", key: "xxx" }
2014-11-20T17:00:01.962-0600 [rsHealthPoll] replSet member MONGOSERVER1:27017 is up
2014-11-20T17:00:01.962-0600 [rsHealthPoll] replSet member MONGOSERVER1
:27017 is now in state SECONDARY


any additional thoughts as to why this could be happening would be helpful since it is clearly unrelated to the network and seems unlikely that it's the actual hardware. it would be nice to have some idea of why mongodb pauses for 16 seconds or so and there's no response back to the the other nodes.



From secondary and arbiter we can see they couldn't connect to primary:

2014-11-20T16:59:13.954-0600 [rsHealthPoll] can't authenticate to CHILXPROD061:27017 (10.0.240.250) failed as internal user, error: DBClientBase::findN: transport error: CHILXPROD061:27017 ns: local.$cmd query: { getnonce: 1 }
2014-11-20T16:59:18.547-0600 [rsHealthPoll] can't authenticate to CHILXPROD061:27017 (10.0.240.250) failed as internal user, error: DBClientBase::findN: transport error: CHILXPROD061:27017 ns: local.$cmd query: { getnonce: 1 }2014-11-20T16:59:40.597-0600 [rsHealthPoll] DBClientCursor::init call() failed
failures until
2014-11-20T17:00:01.962-0600 [rsHealthPoll] replSet member MONGOSERVER1:27017 is up
2014-11-20T17:00:00.435-0600 [rsHealthPoll] replSet member MONGOSERVER1:27017 is up

From the primary, we simply don't see the attempts to connect:
2014-11-20T16:59:07.579-0600 [conn60120] end connection xx.x.xxx.xxx:xxxxx (173 connections now open)
2014-11-20T16:59:08.528-0600 [conn60124] end connection xx.x.xxx.xxx:xxxxx (172 connections now open)
2014-11-20T16:59:29.613-0600 [clientcursormon] mem (MB) res:236057 virt:3745418
2014-11-20T16:59:29.613-0600 [clientcursormon]  mapped (incl journal view):3737954
2014-11-20T16:59:29.613-0600 [clientcursormon]  connections:172
2014-11-20T16:59:29.613-0600 [clientcursormon]  replication threads:32
2014-11-20T16:59:30.754-0600 [conn57521] end connection xx.x.xxx.xxx:xxxxx (171 connections now open)
2014-11-20T17:00:00.433-0600 [initandlisten] connection accepted from xx.x.xxx.xxx:xxxxx #60126 (172 connections now open)
2014-11-20T17:00:00.433-0600 [initandlisten] connection accepted from xx.x.xxx.xxx:xxxxx #60127 (173 connections now open)


There isn't really any way to tell what was happening on the OS level - is it unusual for your primary to log *nothing* for entire 30 seconds?  If so, then it's possible that the mongod process was de-prioritized by the OS for some reason and basically couldn't do anything while swapped out...

Is this a complete log from this time?   The primary's rsHealthPoll should be running and connecting to the other members of the replica set - we don't see any evidence of that either - it really suggests very strongly something was happening on the primary's box that was possibly blocking everything from progressing normally...



Thanks for responding. Yes, that's exactly what I thought was strange is that the primary didn't log anything for a whole 30seconds, not the attempts, nothing. And yes that is the entire log for that time period, I assumed you don't really need to see too much before/after when it is behaving normally again. 

It is clear that the secondary and arbiter weren't able to communicate but since I attached the network traces, which is what I initially thought could be the problem and they were completely clean and it looks like the primary just wasn't sending a response (if you look a couple responses ago you'll see those attachments) even though the connection was healthy. 

Also the reason I started this thread was that I wanted to increase the logging level to see if there would be more useful information in the logs if I do this but when I did that it ended up making my logs insanely large, like from a typical day you would have 2-5MB of logs and that went up to over 5GB just by changing the log level from 0 to 1 on the primary. To be clear I upped the logs *after* this occurred so there was nothing useful, it just seemed to log every query on the server and I eventually had to turn them down because the drive I keep the logs on simply doesn't have that kind of space. But I'm just trying to isolate where the problem could be so I can try to work on that. 

In your response it seems like the problem could potentially be swapping, or some other OS process but that seems pretty unlikely. Currently the only thing running on the box is MongoDB so there isn't much competition for resources. The boxes have 32 cores, 1TB Cachecade drive, 20TB SATA drives, 256GB RAM and 4GB Swap none of which I've observed to be under any strain especially in the middle of the day when these communication lapses have been happening. 

Is there any reason that MongoDB would fail to write to a log file? Or any way to get some sort of logging especially on the replica set health instead of logging all the queries. I'm less interested in getting query data and more on the general health of the cluster/servers so I can prevent this issue from occurring.



Unfortunately in the current version all the logging goes up when you up the verbose level.   Starting with 2.8 we have separate controls for verbosity of different subcomponent and that is exactly what you would want here.

It sounds like you're saying this happens periodically?  (Where primary logs nothing for some non-trivial amount of time) including not responding or sending any heartbeats during that time?

I'm wondering if that's coinciding with flushing data files...   Is this system in MMS by any chance?



Oh sweet, so 2.8 will have more controls on the logs, I'll really be looking forward to that.

Yes this does happen periodically where the primary logs nothing for some non-trivial amount of time including not responding to heartbeats. We really only noticed this behavior because there's a bug in 2.6.1 where the primary wasn't able to rollback some writes when it stepped down after this period of non-communication with the secondary and arbiter. So I had to manually resync everything. I believe the rollback issue is fixed in 2.6.5 and at some point we want to upgrade but if the logging features are in 2.8 we may want to wait until that is out.

This system isn't on MMS currently because the server is not externally facing and therefore unable to connect to MMS. But I have set up some monitoring so I can check the CPU, memory etc and keep an eye out that things appear to be relatively normal for the most part. I'm also running mongostat in the background (so far our system has been handing it without any impact to performance) and sending that info to logstash and kibana to have some of the same information provided as in MMS. However for this time period I can't tell if there was anything happening because I don't see any information logging between 16:59:07 and 17:00:01 on the original primary server. If I look at the secondary however it's fine and still reporting back information up until it becomes the primary at 16:59:30. It would be nice to know if it was happening when flushing data to files unfortunately we have a gap of information in our system at that time basically for what it seems to be anything trying to access MongoDB for just under 30s. The systems we have on it are production systems but aren't so time-sensitive that we would notice a 30s lag though, which is potentially why we didn't catch it until we hit that rollback bug.

Is there any way other than MMS and mongostat to get the monitoring info exposed?

댓글 없음:

댓글 쓰기