Experiencing issue with one of the redis master slave nodes (Replication master link down on host:)

Redis version: 3.2.10

Hello I am facing issue with the Redis cluster where there is diff in Replications of Keys from Master to Slave node, and also the node frequently disconnects from Master during sync.

There is no issues with other nodes the sync is proper the replicas are connected to master all of the time but with the 106(slave) which syncs from master(101) the replica comes up for a short time where the sync happen and drops the connection.

Master logs

1926:M 19 May 19:26:02.066 * Asynchronous AOF fsync is taking too long (disk is busy?). Writing the AOF buffer without waiting for fsync to complete, this may slow down Redis.
39275:C 19 May 19:26:13.630 * DB saved on disk
39275:C 19 May 19:26:14.179 * RDB: 338 MB of memory used by copy-on-write
1926:M 19 May 19:26:15.266 * Background saving terminated with success
1926:M 19 May 19:27:23.898 * Synchronization with slave IP:7005 succeeded
1926:M 19 May 19:27:30.741 * FAIL message received from 3f77505ff8ecc1e8f8c0ca830ab911a2554f7408 about bd2b69c17c616a00fb5d6c1e6bf43740c4ae4abb
1926:M 19 May 19:31:13.367 # Disconnecting timedout slave: IP:7005
1926:M 19 May 19:31:13.367 # Connection with slave IP:7005 lost.
1926:M 19 May 19:33:17.348 * Clear FAIL state for node bd2b69c17c616a00fb5d6c1e6bf43740c4ae4abb: slave is reachable again.

Slave logs

9389:S 19 May 17:28:06.549 * Caching the disconnected master state.
9389:S 19 May 17:28:07.551 * Connecting to MASTER IP:7005
9389:S 19 May 17:28:07.551 * MASTER <-> SLAVE sync started
9389:S 19 May 17:28:07.552 * Non blocking connect for SYNC fired the event.
9389:S 19 May 17:28:07.552 * Master replied to PING, replication can continue...
9389:S 19 May 17:28:07.553 * Trying a partial resynchronization (request 30f35d50a3044d0b8209b9ff496040ea3f2e372d:315799868857).
9389:S 19 May 17:28:08.231 * Full resync from master: 30f35d50a3044d0b8209b9ff496040ea3f2e372d:315827839234
9389:S 19 May 17:28:08.231 * Discarding previously cached master state.
9389:S 19 May 17:31:52.577 * MASTER <-> SLAVE sync: receiving 8124354516 bytes from master
9389:S 19 May 17:33:03.543 * MASTER <-> SLAVE sync: Flushing old data
14111:C 19 May 17:34:05.376 # Write error writing append only file on disk: Connection timed out
9389:S 19 May 17:38:53.933 * MASTER <-> SLAVE sync: Loading DB in memory
9389:S 19 May 17:38:54.249 * AOF rewrite child asks to stop sending diffs.
9389:S 19 May 17:56:50.564 * MASTER <-> SLAVE sync: Finished with success
9389:S 19 May 17:56:50.564 * Killing running AOF rewrite child: 14111
9389:S 19 May 17:56:51.014 * Background append only file rewriting started by pid 27665
9389:S 19 May 17:56:51.432 # Connection with master lost.
9389:S 19 May 17:56:51.432 * Caching the disconnected master state.
9389:S 19 May 17:56:52.435 * Connecting to MASTER IP:7005
9389:S 19 May 17:56:52.435 * MASTER <-> SLAVE sync started
9389:S 19 May 17:56:52.435 * Non blocking connect for SYNC fired the event.
9389:S 19 May 17:56:52.436 * Master replied to PING, replication can continue...
9389:S 19 May 17:56:52.436 * Trying a partial resynchronization (request 30f35d50a3044d0b8209b9ff496040ea3f2e372d:315828723971).
9389:S 19 May 17:56:53.086 * Full resync from master: 30f35d50a3044d0b8209b9ff496040ea3f2e372d:315856923606
9389:S 19 May 17:56:53.086 * Discarding previously cached master state.
9389:S 19 May 18:00:33.495 * MASTER <-> SLAVE sync: receiving 8124500428 bytes from master
9389:S 19 May 18:01:45.284 * MASTER <-> SLAVE sync: Flushing old data
27665:C 19 May 18:02:46.251 # Write error writing append only file on disk: Connection timed out
9389:S 19 May 18:07:28.943 * MASTER <-> SLAVE sync: Loading DB in memory
9389:S 19 May 18:07:29.227 * AOF rewrite child asks to stop sending diffs.
  • 106 is a replica of 101 and there are 38.79T keys on 101 master (Delay on replica by host).
  • The replica node disconnects frequently and joins again
  • The dataset is loaded in Redis Slave memory but then it starts over again with the failed message in log

Hello @initvik

The logs seem to indicate there’s a problem with the node’s storage - have you looked into that?

Also, note that v3.2 isn’t maintained anymore andsoyou should consider upgrading.