Production postmortemThe case of the hung over server

time to read 5 min | 898 words

This was an interesting support call, about six months ago, if I recall correctly. It started with a customer calling because his database would sometimes stop responding for minutes (up to 30 minutes in some cases), and only when there was a high load on the system.

We started looking at this, and it was really strange. They could reliably reproduce this on their production system, but only there, and only under pretty high load. Simulating the same load on a non production machine would have no impact whatsoever. Tripling the load on a non production machine had no impact either.

We asked the customer to send us the debug logs, and we found an interesting tidbit of information in there. The issue always happened when the server was loading a database. The sequence of events went something like this:

  • Lots of users on the site, using the (multiple) databases on the server.
  • For some reason (lunch, meeting, whatever) there is a period of an hour or so when there is no activity on one part of the system, which meant that there are no requests for a particular database.
  • RavenDB would unload the database and release all resources associated with it.
  • After that period of time, and while the server is still in high load, requests would resume for that database. Usually in a big burst of requests for that database.
  • RavenDB would start loading the database.
  • All requests would start hanging.
  • About 15 – 25 minutes later, RavenDB would start the database.
  • All timing information in the logs indicated that there was no long operation.
  • Everything return to normal.

To be honest, this looked a look like someone suspended the entire process somehow. CPU during this time was effectively zero, memory usage was fine and there was no I/O.

The quick workaround was to increase the database idle time to account for the breaks in activity. With the database no longer loading and unloading under load, there was no issue.

Now it was the time to try to figure out what the hell was going on. Going over the logs, we could see that there was a pretty big gaps in the initialization order of the database. To be exact, the initialization of the indexes took almost all the time. But why? Looking at each individual index, it took very little time to initialize, and even all of them together didn’t take a long time. But the time from starting the index initialization until the first index started to initialize was very high.

A few years ago we sped up the initialization times for large RavenDB databases by parallelizing the initialization of all indexes. That was the first clue. Something in this parallel initialization wasn’t working.

And there was also the load factor to consider. The problem only surfaced under load, and to be more precise, under load when we had multiple databases involved, but the issue affected the entire server.

When RavenDB get a request for an unloaded database, it starts loading it in a background thread and the request waits until the database is loaded.

Combine all of those together, and you might start seeing the answer.

RavenDB has an unloaded database, and we suddenly get a high number of requests for that database, all at the same time. RavenDB will start a background task loading the database, and wait until it is loaded to complete the request. However, because we have a lot of requests to this database, we have a lot of threads that are now waiting for it to complete.

Meanwhile, the database is loading in a background thread, and needs to initialize its indexes. It does this in parallel, using the thread pool…

The same thread pool that we serve requests from…

The same thread pool where we have a lot of threads that are currently waiting for the database to load…

The same database that is now waiting for tasks to complete on that busy thread pool…

If you had things going just right, you would get into a situation where the database couldn’t load, because all the thread pool threads it needed to start were busy waiting for the database to load.

Now, the thread pool would slowly increase the number of threads to take care of the amount of work it had. But those threads would usually be kept busy with more requests to the loading database.

Eventually, the index initialization would go through, the database would complete loading, all the database requests would complete, and the system would go back to normal. But in the meantime… ouch!

The temporary workaround, as I mentioned, was increasing the database idle time to prevent this situation from happening. The long term solution was to limit the amount of time that a thread would wait for a request to load, as well as limit the number of threads that could wait for a database to load. The first five requests for an unloaded database would wait up to ten seconds for that database to load, any additional requests would return with a service unavailable code immediately.

We also made sure that waiting for a database to load can be done in an asynchronous manner and in 3.5 we have a dedicated thread pool for internal RavenDB tasks. But I’ll have another post about that.

More posts in "Production postmortem" series:

  1. (12 Dec 2023) The Spawn of Denial of Service
  2. (24 Jul 2023) The dog ate my request
  3. (03 Jul 2023) ENOMEM when trying to free memory
  4. (27 Jan 2023) The server ate all my memory
  5. (23 Jan 2023) The big server that couldn’t handle the load
  6. (16 Jan 2023) The heisenbug server
  7. (03 Oct 2022) Do you trust this server?
  8. (15 Sep 2022) The missed indexing reference
  9. (05 Aug 2022) The allocating query
  10. (22 Jul 2022) Efficiency all the way to Out of Memory error
  11. (18 Jul 2022) Broken networks and compressed streams
  12. (13 Jul 2022) Your math is wrong, recursion doesn’t work this way
  13. (12 Jul 2022) The data corruption in the node.js stack
  14. (11 Jul 2022) Out of memory on a clear sky
  15. (29 Apr 2022) Deduplicating replication speed
  16. (25 Apr 2022) The network latency and the I/O spikes
  17. (22 Apr 2022) The encrypted database that was too big to replicate
  18. (20 Apr 2022) Misleading security and other production snafus
  19. (03 Jan 2022) An error on the first act will lead to data corruption on the second act…
  20. (13 Dec 2021) The memory leak that only happened on Linux
  21. (17 Sep 2021) The Guinness record for page faults & high CPU
  22. (07 Jan 2021) The file system limitation
  23. (23 Mar 2020) high CPU when there is little work to be done
  24. (21 Feb 2020) The self signed certificate that couldn’t
  25. (31 Jan 2020) The slow slowdown of large systems
  26. (07 Jun 2019) Printer out of paper and the RavenDB hang
  27. (18 Feb 2019) This data corruption bug requires 3 simultaneous race conditions
  28. (25 Dec 2018) Handled errors and the curse of recursive error handling
  29. (23 Nov 2018) The ARM is killing me
  30. (22 Feb 2018) The unavailable Linux server
  31. (06 Dec 2017) data corruption, a view from INSIDE the sausage
  32. (01 Dec 2017) The random high CPU
  33. (07 Aug 2017) 30% boost with a single line change
  34. (04 Aug 2017) The case of 99.99% percentile
  35. (02 Aug 2017) The lightly loaded trashing server
  36. (23 Aug 2016) The insidious cost of managed memory
  37. (05 Feb 2016) A null reference in our abstraction
  38. (27 Jan 2016) The Razor Suicide
  39. (13 Nov 2015) The case of the “it is slow on that machine (only)”
  40. (21 Oct 2015) The case of the slow index rebuild
  41. (22 Sep 2015) The case of the Unicode Poo
  42. (03 Sep 2015) The industry at large
  43. (01 Sep 2015) The case of the lying configuration file
  44. (31 Aug 2015) The case of the memory eater and high load
  45. (14 Aug 2015) The case of the man in the middle
  46. (05 Aug 2015) Reading the errors
  47. (29 Jul 2015) The evil licensing code
  48. (23 Jul 2015) The case of the native memory leak
  49. (16 Jul 2015) The case of the intransigent new database
  50. (13 Jul 2015) The case of the hung over server
  51. (09 Jul 2015) The case of the infected cluster