Troubleshooting 6-Min Latency Between Message Creation and Retrieval

Symptom

The duration from message creation to retrieval occasionally reaches 6 minutes, which is not tolerable to services.

Possible Causes

  1. Service requests are stacked and cannot be processed in time.

    According to the monitoring data, only up to 50 messages are stacked and up to 10 messages are created per second, which is within the processing capability limit, so this is not the cause of the symptom.

  2. The EIP inbound traffic decreases.

    If the EIP technical support personnel cannot find any problem, this is not the cause of the symptom.

  3. The consumer group is behaving abnormally.

    According to the server logs, the consumer group is going through frequent rebalance operations. While most rebalance operations are completed within seconds, some can take several minutes. Messages cannot be retrieved until the rebalance is complete.

    This is the cause of the symptom.

Detailed Analysis

A consumer group may exhibit the following three types of behavior in the log:

The following figure shows the duration between Preparing and Stabilized. The time shown in the figure is UTC+0.

Figure 1 Consumer group rebalance

This set of data shows that rebalance performance of the consumer group deteriorates after 06:49 on July 1. As a result, the client becomes abnormal.

Root Cause

Sometimes, a consumer cannot respond to rebalancing in a timely manner. As a result, the entire consumer group is blocked until the consumer responds.

Workaround

  1. Use different consumer groups for different services to reduce the impact of a single consumer blocking access.
  2. max.poll.interval.ms sets the maximum interval for a consumer group to request message consumption. If a consumer does not initiate another consumption request before timeout, the server triggers rebalancing. You can increase the default value of max.poll.interval.ms.

Solution

  1. Use different consumer groups for different services.
  2. Optimize the service processing logic to improve the processing efficiency and reduce the blocking time.

Background Knowledge

A consumer group can be either REBALANCING or STABILIZED.

A consumer group works as follows:

  1. A consumer leaves or joins the group, changing the consumer group metadata recorded at the server. The server updates the consumer group status to REBALANCING.
  2. The server waits for all consumers (including existing ones) to synchronize the latest metadata.
  3. After all consumers have synchronized the latest metadata, the server updates the consumer group status to STABILIZED.
  4. Consumers retrieve messages.