July 22, 2019
Recently I had to add some new nodes to our MongoDB sharded cluster running in AWS replacing rather expensive i2 nodes we finally could retire after their reservation had expired.
The process started as usual and I expected it to take a few hours. However, I was surprised to find that it still running the next day when I returned to office; the replication was stuck in the Startup2 phase.
Having a closer look into the mongodb.log files I’ve spotted an error message informing about a negative document count -1 for one of our collections.
12018-12-19T03:35:26.279+0000 W REPL [replication-85] collection clone for '<DATABASE_NAME>.<COLLECTION_NAME>' failed due to BadValue: While cloning collection '<DATABASE_NAME>.<COLLECTION_NAME>' there was an error 'Count call on collection <DATABASE_NAME>.<COLLECTION_NAME> from <SYNC_SOURCE> returned negative document count: -1'
Further investigation directed me to issue SERVER-35050 in MongoDB exactly matching this problem. Not sure how our collection could ever end in a state where it returns a negative document count I followed the fix recommended by Bruce and called validate() on the collection. The replication process meanwhile continued with attempt 3 of 10 and the next day the cluster had proper secondaries again.
Working as a software engineer for many years mostly in the JVM environment