For this implementation, I use the official redis
npm package (formerly node-redis
). For other library like ioredis
you might run into different problems.
Adding the simple config in the Node.js redis
package to make it PING
.
...
disableOfflineQueue: true,
pingInterval: 1000,
The PING
will keep the heartbeat and the connection will get less likely to be idled and terminated.
This method is confirmed by Azure doc:
The generalized config of Linux cannot recover quickly after TCP retransmission failure.
The another half story is to blame on Azure and Microsoft.
Don’t you think that
And there is no whatsoever problem and we never ever touch them
When there is TCP transmission problem, by default TCP WAIT and try to resend that packet.
The keyword here is ‘WAIT’
That’s why we got the long hangup.
Similar issue also happen to Elasticsearch
The “exponential backoff”, or in simple term “wait to send and if it fail double the wait time”.
As an example initial wait time of 200ms and 5 retransmission will equal to around 6s
But for Linux the number is stored inside the sysctl
and it is 15
meaning around 15 retransmission and this can take 10 minutes or more depending when in the event you start to observe.
$ sysctl -a | grep tcp_retries
net.ipv4.tcp_retries1 = 3
net.ipv4.tcp_retries2 = 15
In the server, this root cause can be confirm by
$ netstat -ant
... Send-Q ... Port ...
6378
// You will see a lot of number in the send-Q
// Confirm that the packet stuck and piled up there
// Waiting the retransmission backoff
Or on my local machine, I can confirm the result with WireShark capture.
Seeing a lot of TCP Retransmission and finally close the connection then open three-way handshake SYN
, SYN-ACK
, ACK
again
On my machine — MacOS, it is better configured for this case than Linux.
This is not problem with application inactivity.
Your RedisPING
will stuck and pile up at theSend-Q
waiting too.
The socket.timeout
implement in redis
is the “Connect” phase timeout. Not when the client is ready sending the command and has the timeout.
With above problem we can just set the sysctl -w <tcp_retries>=number
If your deployment allow you to set it, the recommended for modern app is 3, 5 or 8 as recommended by 100s of RFC 1122.
But if that is not the case you got two problems:
sysctl
in the Dockerfile
sysctl
in ready-made serviceDespite you will find SOME HOPE in the YAML where structure looks like Kubernetes.
Sadly, there is nosecurityContext
where you can escalateprivilleage: true
like that in AKS orsysctls
YAML properties wher you can set it directly.
The solution is to implement “Command timeout”.
When sending PING
, GET
, SET
, we also need timeout here too.
Some trick here, instead of implement timeout everywhere, you can consider some fact that:
If the socket stuck it stuck everywhere.
So we implement the interval for checking and reconnect it centrally instead.
const config = {
url: `redis://${redisConfig.host}:${redisConfig.port}`,
password: redisConfig.password,
disableOfflineQueue: true,
pingInterval: 1000,
};
redis.connect(config);
// This is NOT socket.timeout options
// Which handle only at connecting
// Connection Timeout VS Command Timeout
setInterval(async () => {
try {
const heartbeatKey = `heartbeat`;
// Promise
const read = redis.get(heartbeatKey);
const timer = new Promise((resolve, reject) => {
const wait = setTimeout(() => {
clearTimeout(wait);
reject('Interval Check Command Timeout YES');
}, TIMEOUT_TIME);
});
// Let's race our promises
const race = await Promise.race([read, timer]);
redis.set(heartbeatKey, Date.now());
} catch (error) {
await redis.disconnect();
await connectionRedis(config);
redis.set(`incident-${Date.now()}`, `${new Date()}`);
}
}, INTERVAL_TIME);
It take me few days over night to figure it out.
Hope this save your time in some way.