2014년 12월 4일 목요일

MongoDB Primary crashes after Fatal Assertion 16360

I have a 3 node replica set and I'm running MongoDB 2.6.1

My primary crashed this afternoon and won't come up because of this error.

2014-11-12T15:39:09.836-0600 [repl writer worker 1] ERROR: writer worker caught exception:  :: caused by :: 11000 insertDocument :: caused by ::11000 E11000 duplicate key error index: somedatabase.colname.$field_1_morefield_1  dup key: { : "NAME/141119", : "YOLO" } on: { ts: Timestamp1415827255000|1, h: -7641148999367123021, v: 2, op: "i", ns:"somedatabase.colname", o: { _id: ObjectId('5463cf37e7c5240e5f096dea'),morefield: "YOLO", field: "NAME/141119", server: [ "SERVERFORME038" ] }}
2014-11-12T15:39:09.836-0600 [repl writer worker 1] Fatal Assertion16360
2014-11-12T15:39:09.836-0600 [repl writer worker 1]

***aborting after fassert() failure

I didn't think an index key error would cause a primary to crash and not be able to come up. How can I fix this and how do I keep as much of my data as possible?



Did your secondaries elect a new primary and keep the set online

I believe the crashed node, now joining as a secondary, has some incompatible data with the rest of your replica set and it crashes while replicating that. Not sure how though. If the rest of your set is running, you might wanna consider resyncing your old primary.



Yes they elected a new primary but it won't let me bring up the crashed node as a secondary. Any time I start up that node (with the data left alone) it just crashes again. It only lets me bring up the node if I get rid of all the data files for the database that was throwing the error, which seems like a terrible solution and one I should only use of there are no other options because then I lose all the data from that node that didn't get to sync.



I should add that in my replica set, I have a primary, secondary and arbiter. So only 2 nodes have data.



I'm not sure what you are referring to as data you didn't get to sync.   The error is about a record that is a duplicate which means it doesn't belong - I.e. It shouldn't exist on this node.

What was the original cause of the crash of this machine when it was primary?  How was it configured. - all default options?



This was the original cause of the crash when it was the primary node.



The config on the servers are from a conf file which contains: 

# database
dbpath 
= /opt/mongodata
fork 
= true
# logfile
logpath 
= /var/log/mongodb/mongod.log
logappend 
= true
# net
rest 
= true
httpinterface 
= true
# Apply special variable options if they exist
# replication
replSet 
= ats-Production
# authentication
keyFile 
= /etc/mongodb.key# authorization
auth 
= true

I looked further back in the logs to investigate this as well and found that what you are saying makes sense as the server that crashed was *not* the primary but had become the secondary just before the crash. 

2014-11-12T15:14:36.957-0600 [rsHealthPoll] replset info MONGOSERVER2:27017 thinks that we are down2014-11-12T15:14:36.957-0600 [rsHealthPoll] replSet member MONGOSERVER2:27017 is now in state PRIMARY2014-11-12T15:14:36.958-0600 [rsMgr] stepping down; another primary was elected more recently2014-11-12T15:14:36.964-0600 [rsMgr] replSet relinquishing primary state2014-11-12T15:14:36.964-0600 [rsMgr] replSet SECONDARY2014-11-12T15:14:36.964-0600 [rsMgr] replSet closing client sockets after relinquishing primary

Then there are a bunch of connections being ended which I won't paste here and then I find that there seem to be issues syncing up to the new primary here:

2014-11-12T15:14:38.235-0600 [rsBackgroundSync] replSet our last op time fetched: Nov 12 15:14:36:9
2014-11-12T15:14:38.235-0600 [rsBackgroundSync] replset source's GTE: Nov 12 15:14:38:1
2014-11-12T15:14:38.256-0600 [rsBackgroundSync] replSet rollback 0
2014-11-12T15:14:38.256-0600 [rsBackgroundSync] replSet ROLLBACK
2014-11-12T15:14:38.256-0600 [rsBackgroundSync] replSet rollback 1
2014-11-12T15:14:38.256-0600 [rsBackgroundSync] replSet rollback 2 FindCommonPoint
2014-11-12T15:14:38.265-0600 [rsBackgroundSync] replSet info rollback our last optime:   Nov 12 15:14:36:9
2014-11-12T15:14:38.265-0600 [rsBackgroundSync] replSet info rollback their last optime: Nov 12 15:14:38:2
2014-11-12T15:14:38.265-0600 [rsBackgroundSync] replSet info rollback diff in end of log times: -2 seconds
2014-11-12T15:14:38.270-0600 [rsBackgroundSync] replSet rollback found matching events at Nov 12 15:14:24:9
2014-11-12T15:14:38.270-0600 [rsBackgroundSync] replSet rollback findcommonpoint scanned : 146
2014-11-12T15:14:38.270-0600 [rsBackgroundSync] replSet replSet rollback 3 fixup
2014-11-12T15:14:38.289-0600 [rsBackgroundSync] replSet rollback 3.5
2014-11-12T15:14:38.289-0600 [rsBackgroundSync] replSet rollback 4 n:77
2014-11-12T15:14:38.289-0600 [rsBackgroundSync] replSet minvalid=Nov 12 15:14:38 5463cdbe:2
2014-11-12T15:14:38.289-0600 [rsBackgroundSync] replSet rollback 4.6
2014-11-12T15:14:38.289-0600 [rsBackgroundSync] replSet rollback 4.7
2014-11-12T15:14:38.342-0600 [rsBackgroundSync] replSet exception in rollback ns:
 somedatabase.colname { _id: ObjectId('53fdcaa3051819f15fea2025') } 11000 E11000 duplicate key error index: somedatabase.colname.$field_1_morefield_1  dup key: { : null, : null } ndeletes:0
2014-11-12T15:14:38.347-0600 [rsBackgroundSync] replSet rollback 5 d:44 u:55
2014-11-12T15:14:38.351-0600 [rsBackgroundSync] replSet rollback 6
2014-11-12T15:14:38.355-0600 [rsBackgroundSync] replSet rollback 7
2014-11-12T15:14:38.359-0600 [rsBackgroundSync] replSet issues during syncRollback, see log
2014-11-12T15:14:38.359-0600 [rsBackgroundSync] replSet RECOVERING
2014-11-12T15:14:38.359-0600 [rsBackgroundSync] replSet syncing to: MONGOSERVER2:27017
2014-11-12T15:14:38.360-0600 [rsBackgroundSync] replset setting syncSourceFeedback to MONGOSERVER2:27017

So we see that it has issues with that index before. And then we get this error which causes the server to completely crash.

2014-11-12T15:20:55.480-0600 [repl writer worker 1] ERROR: writer worker caught exception:  :: caused by :: 11000 insertDocument :: caused by ::11000 E11000 duplicate key error index: somedatabase.colname.$field_1_morefield_1  dup key: { : "NAME/141119", : "YOLO" } on: { ts: Timestamp1415827255000|1, h: -7641148999367123021, v: 2, op: "i", ns:"somedatabase.colname", o: { _id: ObjectId('5463cf37e7c5240e5f096dea'),exchange: "YOLO", helixSymbol: "NAME/141119", orders: [ "SERVERFORME38"] } }
2014-11-12T15:20:55.480-0600 [repl writer worker 1] Fatal Assertion16360
2014-11-12T15:20:55.480-0600 [repl writer worker 1]

***aborting after fassert() failure

Following this we have been looking into our "new" primary which caused this secondary crash and it looks like there is incomplete data in there. So I'm curious 
1) Why this server crashed over a duplicate key index? I'm more concerned because the only way to get my server up was to move the data files for the database that was causing the server to crash out of the data directory.
2) Is there a way to verify that the data is being replicated successfully and correctly so we don't have documents that seem erroneously created? Currently I am taking the data files from the crashed server and using them on a standalone mongo instance without replication or authentication to see what data I have and what is missing but is there anything else we can do?



Good piece of investigation!   So the machine was a primary, lost
connection to other members (or for whatever reason they couldn't see
it) so they elected a new primary.  When this one saw it, it had to
step down.   Now, it must roll back the writes it has which didn't
propagate to the other members - that's how it keeps its data in sync
with all other members.

There was an error during attempted rollback, possibly because of
SERVER-9870 - we think it might be fixed now, but maybe wasn't in your
build.



Ah, so there is a bug, I was thinking that something seemed fishy since the rollback didn't seem to complete and it just crashed. Do I need to upgrade my cluster to avoid this in the future? (I mean ideally the cluster shouldn't elect a new primary when the server didn't actually lose connection but that's something I need to look into on the network side separately, obviously not a Mongo problem). But I'd like to make sure that I can protect us as much as possible. 



The reason that ticket hasn't been marked fixed is we don't have good
enough tests for this case (which would prove the issue is fixed, or
if they failed would prove it's still there).   But the code to fix
this in there - though I'm not certain what version it's in.



That makes sense, thanks for the help Asya.

I have one more questions. I've just been trying to figure out why the server wasn't responding to connections and I haven't found anything on my network side. But on both the secondary and the arbiter I find this bit: 
--Secondary
2014-11-12T15:14:14.532-0600 [rsHealthPoll] DBClientCursor::init call()failed
2014-11-12T15:14:14.542-0600 [rsHealthPoll] can't authenticate to MONGOSERVER1:27017 (10.0.240.250) failed as internal user, error: DBClientBase::findN: transport error: MONGOSERVER1:27017 ns: local.$cmd query: { getnonce: 1 }

--Arbiter
2014-11-12T15:14:22.484-0600 [rsHealthPoll] DBClientCursor::init call() failed2014-11-12T15:14:22.485-0600 [rsHealthPoll] can't authenticate to MONGOSERVER1:27017 (10.0.240.250) failed as internal user, error:DBClientBase::findN: transport error: MONGOSERVER1:27017 ns: local.$cmd query: { getnonce: 1 }
2014-11-12T15:14:24.544-0600 [conn529946] MONGOSERVER2:27017 is trying to elect itself but MONGOSERVER1:27017 is already primary
2014-11-12T15:14:26.983-0600 [conn529947]  authenticate db: local { authenticate: 1, nonce: "xxx", user: "__system", key: "xxx" }
2014-11-12T15:14:30.984-0600 [conn529947] MONGOSERVER2:27017 is trying to elect itself but MONGOSERVER1:27017 is already primary
2014-11-12T15:14:32.485-0600 [rsHealthPoll] DBClientCursor::init call() failed
2014-11-12T15:14:32.486-0600 [rsHealthPoll] replSet info MONGOSERVER1:27017 is down (or slow to respond):
2014-11-12T15:14:32.486-0600 [rsHealthPoll] replSet member MONGOSERVER1:27017 is now in state DOWN
2014-11-12T15:14:36.545-0600 [conn529947] replSet info voting yea for MONGOSERVER2:27017 (1)
2014-11-12T15:14:36.941-0600 [rsHealthPoll] replSet member MONGOSERVER1:27017 is up
2014-11-12T15:14:36.941-0600 [rsHealthPoll] replSet member MONGOSERVER1:27017 is now in state SECONDARY

So from my analysis of the logs. there wasn't anything going on when the secondary and the arbiter were trying to connect there were incoming writes which came in just fine and I don't see anything in the logs that indicate that either were trying to connect.

From the arbiter and secondary logs it seems like they can't authenticate but nothing has changed in the setup and the primary server didn't have an extraordinary number of connections at the time: 178 just before it got stepped down. Also it looks like the arbiter was able to connect to it immediately after the seconadary was stepped up. 

Is there anything else that could be causing this that I may need to configure on the server side? 

I found some stuff about file limits while google searching the client connect issue so I checked and this is what I see and it seems like we should be well under all the limits. It would be nice to know why this suddenly stopped accepting connections, especially since it doesn't seem like there was a networking issue since the application could connect.

$ ulimit -a
core file size          
(blocks, -c) unlimited
data seg size           
(kbytes, -d) unlimited
scheduling priority             
(-e) 0
file size               
(blocks, -f) unlimited
pending signals                 
(-i) 2066228
max locked memory       
(kbytes, -l) 64
max memory size         
(kbytes, -m) unlimited
open files                      
(-n) 32768
pipe size            
(512 bytes, -p) 8
POSIX message queues     
(bytes, -q) 819200
real
-time priority              (-r) 0
stack size              
(kbytes, -s) 10240
cpu time               
(seconds, -t) unlimited
max user processes              
(-u) 1024
virtual memory          (kbytes, -v) unlimited
file locks                      
(-x) unlimited



Hard to say for sure, but one possibility is partial network issue affecting primary to secondary/arbiter but not application to primary.   Another possibility is primary becoming unresponsive due to some local/OS/db issue but that seems less likely since there are 10 seconds between secondary error to primary and arbiters error (and heartbeats go every two seconds...



Ok, that helps. I'll focus my efforts looking into potential networking issues instead of eating time looking into the OS/local hardware.


댓글 없음:

댓글 쓰기