One of the most devastating and unpredictable latency intruders is the Java Virtual Machine’s “stop the world” pauses for garbage collection (memory clean-up). This post troubleshoots an unexpected shutdown of RegionServer fundamentally caused by Full GC and makes the whole process clear.

A Troubleshooting

You Are Dead

A RegionServer stops, and the logs say:

RegionServer throws a YouAreDeadException when checking if it is alive from HMaster. Unfortunately, HMaster marks it as dead, so the RegionServer have to suicide.

Why HMaster deems that a RegionServer still running dies?

Lease Expired

Scroll up the logs, and it’s easy to find:

The lease with HDFS NameNode is expired, a direct problem is of course timeout. But what’s the root reason.

ZooKeeper Client Session Time Out

Few pages up, the first beginning of shutdown is:

HRegionServer slept for 79s due to a long garabage collection. Finally, the criminal appears.

Explanation

Let’s elucidate what happens after Full GC on RegionServer.

  1. the heartbeat between RegionServer and Zookeeper stops when Full GC on RegionServer
  2. ZooKeeper deletes node corresponding to the unbearable RegionServer
  3. ServerManager on HMaster finds there must be something wrong with this RegionServer, and hands over to ServerShutdownHandler
  4. SplitLogManager on HMaster and its slave, SplitLogWorker on RegionServer start log splitting. For details, see Apache HBase Log Splitting
  5. AssignmentManager on HMaster assigns moribund regions to another RegionServer. The logs are replayed by reading the edits and saving them to the MemStore. After all edit files are replayed, the contents of the memstore are written to disk (HFile) and the edit files are deleted.
  6. the reborn regions begin to be available
  7. when the Full GC stops on the moribund RegionServer, it try to heartbeat with HMaster by tryRegionServerReport() and received a YouAreDeadException
  8. the mortal RegionServer closes all the threads and dies