A few weeks ago, we were discussing a problem with my team mates and a colleague raised a question about a very basic distributed remote caching problem which was about an application scaled out across multiple instances, managing how data is cached can become a significant challenge. Let me take the shortcut and directly jump into the problem. Consider a scenario where you have multiple instances doing the same job and you need to cache the result of a function which takes about 30 seconds. This function’s result needs to be cached to avoid redundant executions across different instances. The challenge is making sure that the function is executed only once and the result is shared across all instances via Redis.
A good strategy to solve the issue might be implementing a distributed locking mechanism for Redis. This approach requires an instance checking and setting a lock (if necessary) before executing the function. If the lock is already set by another instance, it shows that the function is being executed. The current instance waits and retrieves the cached result. The core idea behind this strategy is to use Redis not only as a caching tool but also as a mechanism to orchestrate the long-running function across instances.
When an instance of the application wants to execute the function, it first attempts to acquire a lock in Redis. This lock is identified by a unique key known to all instances. If the lock is successfully acquired, this indicates that no other instance is currently executing or has executed the function. The instance that acquires the lock then proceeds to execute the function.
While running the function, other instances attempting to execute it will also try to acquire the lock. If the lock is already held, these instances will either wait or immediately fetch the cached result if available. The decision depends on the specific needs of the application and how up to date the data should be. Once the function execution is complete, Redis stores the result and releases the lock. This indicates to any waiting instances that they can now retrieve the latest result from cache instead of re executing the function. To handle situations where there may be updates or changes in parameters for the function, versioning can be implemented. This way, if a newer version identifier is received while a previous version is still running, the newer request can either wait or proceed based on application logic and data freshness requirements. This approach ensures efficient execution of the function at any given time, minimizing redundant computations and maintaining consistency across different instances of the application.
It works really well in cloud environments, especially when you have multiple instances that need to coordinate complicated tasks or process data without interfering with one another.
Here’s an example using C# and Redis:
using StackExchange.Redis;
using System;
using System.Threading.Tasks;
public class CachedComputationService
{
private readonly IDatabase _redisDatabase;
private string ComputationKey = "computation_result";
private string LockKey = "computation_lock";
public CachedComputationService(IDatabase redisDatabase)
{
_redisDatabase = redisDatabase;
}
public async Task<string> GetComputationResultAsync()
{
var cachedResult = await _redisDatabase.StringGetAsync(ComputationKey);
if (!cachedResult.IsNullOrEmpty)
{
return cachedResult;
}
if (await _redisDatabase.LockTakeAsync(LockKey, "lock", TimeSpan.FromMinutes(2)))
{
try
{
cachedResult = await _redisDatabase.StringGetAsync(ComputationKey);
if (!cachedResult.IsNullOrEmpty)
{
return cachedResult;
}
var result = await LongRunningCalculation();
await _redisDatabase.StringSetAsync(ComputationKey, result, TimeSpan.FromHours(1));
return result;
}
finally
{
await _redisDatabase.LockReleaseAsync(LockKey, "lock");
}
}
// Wait and retry if the lock was not acquired
await Task.Delay(1000);
return await GetComputationResultAsync();
}
private async Task<string> LongRunningCalculation()
{
// a long-running computation
}
}
And this is the Python code for the solution:
import redis
import time
redis_client = redis.StrictRedis(host='localhost', port=6379, db=0)
def long_running_function():
# Simulate a long-running task
time.sleep(30)
return "Result of complex calculation"
def acquire_lock_with_timeout(lock_name, timeout=60):
"""Try to acquire a lock with a timeout to avoid deadlocks."""
end = time.time() + timeout
while time.time() < end:
if redis_client.set(lock_name, "true", ex=60, nx=True):
return True
time.sleep(1)
return False
def release_lock(lock_name):
"""Release the lock."""
redis_client.delete(lock_name)
def get_cached_result_or_run():
computation_key = "computation_result"
lock_key = "computation_lock"
cached_result = redis_client.get(computation_key)
if cached_result:
return cached_result.decode()
if acquire_lock_with_timeout(lock_key, 60):
try:
cached_result = redis_client.get(computation_key)
if cached_result:
return cached_result.decode()
result = long_running_function()
redis_client.setex(computation_key, 3600, result)
return result
finally:
release_lock(lock_key)
else:
time.sleep(10)
return get_cached_result_or_run()
result = get_cached_result_or_run()
print(f"Function result: {result}")
On the other hand, Managing the scenario where the instance took responsibility of updating the cache fails before completing the task is crucial for the reliability of the system. These are steps to handle such a situation:
1. Lock Expiration and Renewal Mechanism:
- Implement a lock with an expiration time that slightly exceeds the expected duration of the update task. For instance, if it is expected to take 30 seconds, set the lock to expire after 45 seconds to 1 minute.
- Design the system so that the instance holding the lock sends periodic heartbeats or renewal signals to Redis, telling that it’s still active. If Redis stops receiving these signals (due to instance failure), it will automatically release the lock after the expiration time.
- Other instances should continuously monitor the lock status. Once the lock is released (either normally or due to expiration), one of them can acquire the lock and start or resume the update process.
2. Health Checks and Automated Recovery:
- Implement health checks for the instance performing the update. These checks should monitor not just the instance’s basic health but also the progress of the update task.
- In case a health check fails, indicating that the instance is unresponsive or the update process has stalled, an automated recovery mechanism should be triggered. This mechanism can involve spinning up a new instance or signalling an existing standby instance to attempt acquiring the lock and continue the update process.
- This setup ensures minimal downtime and quick recovery from failures, maintaining the continuous availability of the update process.
3. Checkpoint-Based Update Process:
- Design the update function to operate in stages or checkpoints. After completing each stage, the function records a checkpoint in a persistent store (like a database or a file system).
- If the instance fails and another instance takes over, the new instance can resume the update process from the last recorded checkpoint instead of starting from scratch.
- This approach minimizes redundant computations and reduces the time to complete the update after a failure, making the overall process more efficient and resilient.
Suleyman Cabir Ataman, PhD