Skip to content

6. Operational Knowledge for Redis

This is the scenario where your Spring Boot application can no longer connect to the Redis server. This could be due to a Redis process crash, a server reboot, or a network failure between your application and Redis.

Your application’s behavior changes instantly and dramatically depending on how you use Redis:

  • For Caching: Every request that would have been a cache hit now becomes a cache miss. The RedisTemplate calls or @Cacheable proxies will time out or throw exceptions.
  • For Sessions: If you use Spring Session, no user sessions can be retrieved or created. From the user’s perspective, everyone is logged out, and no one can log in.
  • For Distributed Locks: Your application cannot acquire locks. Any process that relies on a lock to prevent race conditions will either fail or, worse, proceed without the lock, leading to data corruption.
  • For Pub/Sub: Publishers can no longer send messages, and subscribers are disconnected. Real-time notifications and decoupled tasks stop flowing.

The most dangerous consequence of a cache failure is the cascading failure it causes.

Imagine your application normally serves 95% of its read traffic from the Redis cache. When Redis goes down:

  1. That 95% of traffic, which was previously handled in sub-milliseconds by Redis, is now re-routed to your primary database.
  2. Your database, which was only sized to handle 5% of that traffic, is suddenly hit with a 20x load increase.
  3. This massive, unexpected load can easily overwhelm the database’s connection pool, CPU, and I/O, causing it to slow down to a crawl or crash entirely.

Result: A failure in your caching layer has now taken down your entire system.

Diagram: Cascading Failure

NORMAL OPERATION:
+-----------------+ 95% +---------+
| Spring Boot App | ------->| Redis |
+-----------------+ +---------+
| 5%
v
+-----------------+
| Database | (Lightly loaded)
+-----------------+
REDIS CRASHES:
+-----------------+ 100% (All requests miss cache)
| Spring Boot App | -------> [ Redis is DOWN ]
+-----------------+
| 100%
v
+-----------------+
| Database | (OVERWHELMED & CRASHES)
+-----------------+

How to Handle It: Building Resilient Applications

Section titled “How to Handle It: Building Resilient Applications”

Your application code must be prepared for Redis to be unavailable.

  1. Timeouts: Configure aggressive connection and read timeouts for your Redis client (e.g., in application.properties). It’s better to fail fast than to have threads hanging for 30 seconds waiting for Redis.
  2. Circuit Breaker Pattern: This is the most important solution. Wrap your Redis calls in a Circuit Breaker (using a library like Resilience4j).
    • If Redis calls start failing, the circuit “opens,” and subsequent calls fail instantly without even trying to connect to Redis.
    • This protects your application from being bogged down by failing network calls and, more importantly, protects your database from the stampede.
    • You can configure a “fallback” method to serve a default value or a less personalized response when the cache is unavailable.
  3. Graceful Degradation: The application should continue to function, perhaps in a degraded state. For an e-commerce site, this might mean the “personalized recommendations” module (which relies on Redis) is hidden, but users can still search for and buy products.

These are classic system design problems that occur when using a cache.

1. Cache Penetration (Missing Keys Problem)

Section titled “1. Cache Penetration (Missing Keys Problem)”
  • The “What”: When a malicious user, or a bug, intentionally and repeatedly requests data for a key that does not exist.
  • The Impact: Since the key is never in the cache (because it’s not in the database), every single one of these requests bypasses the cache and hits your database directly. This can be used to launch a targeted Denial of Service (DoS) attack on your database.
  • The Solutions:
    1. Cache Null Values: When your service looks up an ID in the database and finds nothing, don’t just return null. Instead, store a special “null” placeholder object in the cache for that key with a very short TTL (e.g., 1-5 minutes). The next request for the same non-existent key will get a cache hit on this “null” object, protecting the database.
    2. Bloom Filter: A more advanced, memory-efficient probabilistic data structure. A Bloom filter can tell you if an item is definitely not in a set. You can check the Bloom filter first. If it says the ID definitely doesn’t exist, you can reject the request immediately without even hitting the cache or the database.
  • The “What”: This happens when a massive number of cache keys expire at the exact same time. This is common if you populate your cache on application startup, giving all keys the same TTL.
  • The Impact: At the moment of expiration, the cache effectively becomes empty. The subsequent wave of user requests all result in cache misses, creating a massive, simultaneous load spike on the database, similar to the “Redis Crashes” scenario.
  • The Solutions:
    1. Randomized TTL (Jitter): This is the simplest and most effective solution. Instead of setting a fixed TTL of 300 seconds, set a TTL of 300 + a random number of seconds between 0 and 60. This spreads the expiration of keys over time, smoothing out the load on the database.
    2. High Availability Backend: Ensure your database is scaled (e.g., with read replicas) to handle a temporary surge in traffic.

3. Cache Stampede (already covered, but reiterated here for context)

Section titled “3. Cache Stampede (already covered, but reiterated here for context)”
  • The “What”: A single, very popular key expires. At that moment, hundreds or thousands of concurrent threads/requests all miss the cache for that same key and “stampede” to the database to regenerate the value.
  • The Impact: You are performing the same expensive computation hundreds of times in parallel, wasting massive amounts of CPU and putting unnecessary load on the database.
  • The Solution:
    1. Synchronized Regeneration: Use a locking mechanism so that only the first thread performs the computation, while all other threads wait. Once the first thread populates the cache, the waiting threads can get the value from there. Spring’s @Cacheable(sync=true) does exactly this, and it’s the best solution in a Spring environment.

A single Redis instance can only hold so much data and handle so much traffic. There are two primary strategies for scaling.

1. Sentinel (for High Availability / Failover)

Section titled “1. Sentinel (for High Availability / Failover)”
  • Problem Solved: “My entire dataset fits on one server, but I cannot tolerate downtime if that single server fails.”
  • How it Works: You set up a Master-Replica architecture. The Master handles all writes, and the Replicas asynchronously copy all data from the master. You then run several Sentinel processes.
    • The Sentinels constantly monitor the health of the master.
    • If a majority of Sentinels agree the master is down, they trigger a failover.
    • They elect one of the replicas to become the new master.
    • They reconfigure the other replicas to copy from the new master.
    • They inform your application clients about the new master’s address.
  • Use Case: Critical applications where automatic failover is required, but the dataset size is manageable by a single machine.

2. Redis Cluster (for Sharding / Scalability)

Section titled “2. Redis Cluster (for Sharding / Scalability)”
  • Problem Solved: “My dataset is too large to fit in one server’s RAM” or “My write traffic is too high for a single CPU core to handle.”
  • How it Works: Redis Cluster provides sharding. The entire keyspace is divided into 16,384 “hash slots.”
    • Your data is partitioned across multiple Redis master nodes. For example, Master A holds slots 0-5500, Master B holds 5501-11000, etc.
    • When your application wants to operate on a key (e.g., GET user:123), the Redis client hashes the key to determine which slot it belongs to, and sends the command directly to the correct master node.
    • Each master in the cluster can have its own replicas for high availability within that shard.
  • Use Case: Very large datasets (terabytes) or extremely high write throughput that requires horizontal scaling.

You need to be able to inspect the health and state of your Redis instance.

  • INFO command: Your primary health-check tool. It provides a huge amount of information. Key sections to check:
    • memory: Check used_memory_human to see how full it is and mem_fragmentation_ratio (if > 1.5, it might indicate memory waste).
    • stats: Check keyspace_hits and keyspace_misses to calculate your cache hit ratio (hits / (hits + misses)). A good ratio is typically > 95-99%.
    • persistence: See if RDB snapshots or AOF rewrites are currently running.
  • MONITOR command: A debugging tool that streams back every single command being processed by the Redis server in real-time. Warning: This has a significant performance impact and should NEVER be used on a production server for extended periods. It’s for short-term debugging only.
  • RedisInsight UI: A modern graphical user interface from Redis. It provides a much more user-friendly way to browse keys, view server info from the INFO command, and analyze performance issues. It’s highly recommended for development and operational management.

By default, Redis is not secure. You must take these basic steps.

  • Authentication:

    • The “What”: Protecting your Redis server with a password.
    • How: In the redis.conf file, set the requirepass directive to a long, complex password. After this is set, clients must issue the AUTH <password> command before they can run any other commands.
    • Spring Boot Config: In application.properties, set spring.data.redis.password=your-secret-password. In production, this secret should be managed by a proper secrets management tool (like HashiCorp Vault, AWS Secrets Manager, etc.) and not be in plain text.
  • SSL/TLS (Encryption in Transit):

    • The “What”: Encrypting the network traffic between your Spring Boot application and the Redis server.
    • Why: If not enabled, anyone who can monitor the network can see all your cached data (including sensitive user session information) in plain text. This is essential when connecting to Redis over a non-trusted network (e.g., across the public internet to a cloud-hosted Redis instance).
    • Spring Boot Config: In application.properties, simply set spring.data.redis.ssl.enabled=true. The Redis client (Lettuce) will then handle the SSL handshake. You may need to configure truststores/keystores if you are using self-signed certificates.