State management is the unglamorous linchpin of serverless applications. A function that can't remember what happened five milliseconds ago isn't useful for much beyond trivial data transformation. Yet many teams adopt serverless for its scaling promises and then discover that the same scaling properties make state management harder. This guide establishes a qualitative benchmark—a way to think about and evaluate serverless state management without relying on fabricated metrics. We'll focus on what actually breaks, what works, and how to choose between trade-offs.
Why State Management in Serverless Is a Different Beast
In traditional server-based applications, state lives in memory on the same machine that runs your code. You have a process, a heap, a database connection pool—everything is cozy and local. Serverless functions are ephemeral. They start cold, run for a few seconds or minutes, and then disappear. The next invocation might land on a completely different container. This means any state that outlives a single request must be stored externally.
The problem is that external storage adds latency, cost, and consistency headaches. A DynamoDB read might take 10–30 milliseconds under load, which is fine for one request but painful when a workflow needs to read and write state five times. Worse, distributed state introduces races: two invocations of the same function can try to update the same record simultaneously. Without proper locking or conditional writes, you get corrupted data.
Teams often fall into the trap of treating serverless state management as an afterthought. They write code that assumes a single-threaded, always-on process and then wonder why their order-processing pipeline double-charges customers. The first step in any qualitative benchmark is acknowledging that serverless state is not just 'state moved to a database'—it's a distributed systems problem in disguise.
What Goes Wrong Without a Strategy
Without deliberate design, three common failures emerge: idempotency violations, lost intermediate state, and runaway costs. Idempotency violations happen when a retry causes the same operation to execute twice. For example, a payment function that checks if a transaction exists might miss the check because the first write hasn't replicated yet. Lost intermediate state occurs when a multi-step workflow crashes midway and there's no durable record of progress. Cost blowups come from over-fetching data or polling databases when a simpler event-driven approach would work.
Who This Benchmark Is For
This guide is for developers and architects who are building serverless workflows—think order processing, data pipelines, or user onboarding flows. If you're using AWS Lambda, Azure Functions, or Google Cloud Functions and you've hit a wall with state, this qualitative framework will help you evaluate options. We won't give you a single answer because there isn't one. Instead, we'll give you the criteria to make your own decision.
What to Settle Before You Choose a State Strategy
Before evaluating any specific tool or pattern, you need to clarify three things: your consistency requirements, your latency budget, and your cost ceiling. These are not technical trivia—they are the lens through which every state solution looks different.
Consistency: Strong or Eventual?
Strong consistency means every read sees the latest write. Eventual consistency means reads might return stale data for a short window. In serverless, strong consistency often forces you to use a single-writer pattern or a database like DynamoDB with strongly consistent reads (which cost more and have higher latency). Eventual consistency is cheaper and faster but requires your application to tolerate stale data. A counter-example: a leaderboard can easily use eventual consistency; a payment ledger cannot.
Many teams default to eventual consistency because it's easier—and then get burned when a user sees an outdated balance. Our benchmark asks: can your business logic handle a few seconds of staleness? If yes, eventual is fine. If not, you need a design that enforces strong consistency, possibly by routing all writes for a given entity through a single function instance.
Latency Budget: How Fast Can State Be Read and Written?
Serverless functions have a timeout (usually up to 15 minutes on AWS Lambda), but individual operations need to be faster. If your workflow reads state from a key-value store and then writes it back, each round trip of 20 ms adds up. For a workflow with ten state operations, that's 200 ms just in I/O. For real-time applications like gaming or chat, that's too slow. For batch processing, it's fine.
We've seen teams over-engineer state for latency requirements that don't exist. A nightly data export doesn't need microsecond consistency. A user-facing checkout flow does. Set your latency budget before you pick a tool—don't let the tool dictate your architecture.
Cost Ceiling: What Are You Willing to Pay per Million Invocations?
Serverless state management costs come from three sources: storage (per GB-month), read/write operations (per request), and data transfer (per GB). DynamoDB charges for read and write capacity units. Redis (via ElastiCache or managed services) charges per node hour. Queue services like SQS charge per request. These costs scale linearly with usage, so a poorly designed state strategy can bankrupt a startup.
A qualitative benchmark doesn't give you dollar amounts, but it forces you to think about cost-to-value. For high-throughput systems, batching writes or using a write-behind cache can reduce costs. For low-throughput systems, a simple S3 bucket with versioning might be cheaper than a database. Always project costs for your expected load—and then double it.
A Qualitative Benchmark for Evaluating State Strategies
With the prerequisites clear, we can now define a benchmark. This is not a numeric score but a set of questions you ask about each candidate approach. We'll apply it to three common strategies: external key-value store (DynamoDB), in-memory cache with persistence (Redis), and event-driven orchestration (Step Functions + S3).
Criteria 1: Idempotency Handling
Can the strategy naturally prevent duplicate writes? DynamoDB's conditional expressions let you write only if a version number matches. Redis can use Lua scripts for atomic compare-and-swap. Step Functions natively track execution IDs and deduplicate on retries. Evaluate each approach: does it require extra code to be idempotent, or is it built in?
Criteria 2: Recovery from Partial Failure
What happens when a function crashes mid-workflow? With DynamoDB, you can store progress markers and resume from the last known step. With Redis, if the cache is cleared, you lose progress unless you also persist to a database. Step Functions automatically retry failed steps and can store execution history indefinitely. Recovery is often the weakest point of cache-only strategies.
Criteria 3: Concurrency and Locking
Can two invocations safely update the same state? DynamoDB's optimistic locking with version numbers works well for low contention. Redis has built-in locks (Redlock), but they require careful implementation. Step Functions avoid the problem by serializing execution—each workflow runs in its own context. For high-contention scenarios, serialization might become a bottleneck.
Criteria 4: Operational Overhead
How much maintenance does the state layer require? DynamoDB is fully managed—no servers to patch. Redis (self-managed) requires cluster management, monitoring, and failover planning. Step Functions are managed but have a learning curve and limited debugging tools. The benchmark should weigh operational cost against flexibility.
Applying the Benchmark to a Composite Scenario
Consider an order processing system that takes an order, charges the customer, updates inventory, and sends a confirmation email. A team might start with DynamoDB for everything: store order state, check inventory, update counts. Under low load, this works. Under high load, writes to the inventory item become contended. The team then adds a Redis cache for inventory counts, but now they have to handle cache misses and eventual consistency. Eventually, they move the orchestration to Step Functions, using DynamoDB only for the final order record. The benchmark helps them see that the first approach fails on concurrency, the second fails on recovery, and the third requires more initial setup but handles both well.
Tools and Setup for Qualitative State Evaluation
You don't need a lab to run this benchmark. Start with a simple test: implement a two-step workflow (read, modify, write) with each candidate approach. Measure not just latency but also error rates under simulated retries and concurrent requests. The setup is minimal—a few Lambda functions and a test harness that fires events at increasing concurrency.
DynamoDB as a State Store
Setup is straightforward: create a table with a primary key (e.g., workflow ID) and a sort key for steps. Use conditional writes for idempotency. The main pitfalls are hot partitions (if you use a sequential ID as the partition key) and cost from read/write capacity. We recommend on-demand capacity for variable workloads. Test with TTL to expire old state automatically.
Redis as a State Cache with Persistence
Setup requires a Redis cluster (ElastiCache or a managed service). Use Redis streams or sorted sets to store ordered state. Enable AOF persistence for durability. The main advantage is low latency (sub-millisecond), but you must handle cache warming on cold starts and data loss on failover. For the benchmark, simulate a node failure and see how much state survives.
Step Functions for Orchestration
Step Functions natively manage workflow state. Each execution has an input, output, and history. You don't need to store intermediate state yourself—the service does it. Setup is declarative (Amazon States Language). The pitfalls are cost per state transition and the 256 KB payload limit. For the benchmark, test a workflow with 50 steps and see if it fits your budget.
Queue-Based State (SQS + Lambda)
For simple state, you can encode progress in queue messages. A function processes a message, updates state in DynamoDB, and sends a new message for the next step. This works for linear workflows but becomes hard to manage for branching or fan-out. Setup is simple—just queues and functions—but debugging requires tracing message flows.
Variations for Different Constraints
Not every workload needs the full benchmark. Here are variations for common constraints.
High Throughput, Low Consistency (e.g., Log Aggregation)
Use a fire-and-forget approach: write state to S3 or a log stream. Don't worry about idempotency—duplicates are fine. Redis with a simple key-value pattern works, but consider using Kinesis Firehose to batch writes. The benchmark here focuses on cost and write speed, not correctness.
Strict Consistency, Low Throughput (e.g., Financial Transactions)
Use DynamoDB with strongly consistent reads and conditional writes. Consider adding a lock table or using DynamoDB Transactions (which are atomic but slower). Step Functions with a serial execution pattern also works. The benchmark prioritizes idempotency and recovery over latency.
Long-Running Workflows with Human Steps (e.g., Approval Flows)
Step Functions are the natural fit because they can pause for days. Store the execution ARN in a database for lookup. Avoid storing state in short-lived Redis since the workflow may outlive the cache. The benchmark here tests how well the system handles timeouts and resumptions.
Multi-Region State (e.g., Global User Sessions)
Use DynamoDB Global Tables or a cross-region Redis cluster. Both have trade-offs: Global Tables are eventually consistent across regions; Redis requires active-active setup with conflict resolution. The benchmark should measure read latency from different regions and conflict rates.
When State Management Breaks: Pitfalls and Debugging
Even with a good strategy, things go wrong. Here are the most common failures we've seen and how to diagnose them.
Idempotency Failures
You see duplicate charges or duplicate emails. Check if your state store uses conditional writes. For DynamoDB, ensure you're using version numbers or a unique request ID as the sort key. For Redis, check that your Lua script is atomic. Debug by enabling function logging and tracing—look for multiple invocations with the same ID.
Hot Partitions
One partition (or shard) receives most of the writes. In DynamoDB, this shows as throttled requests. Fix by using a composite key with a random suffix or by distributing load across partitions. In Redis, hot keys can be mitigated by splitting the key into multiple keys or using a hash tag.
Cold Start Latency
When a function starts cold, it may need to reconnect to the state store. For Redis, this means establishing a new connection, which can add 100+ ms. Pre-warm connections using a global client or use a connection proxy like PgBouncer for databases. For DynamoDB, the SDK handles retries, but cold starts still add latency. Consider using provisioned concurrency for latency-sensitive functions.
Cost Explosion
You get a bill that's ten times higher than expected. Common causes: polling a database instead of using events, reading entire records when only a subset is needed, or forgetting to set TTL on old state. Audit your read/write patterns. Use CloudWatch metrics to identify the most expensive operations. Often, switching to a smaller payload or batching writes cuts costs dramatically.
State Drift in Step Functions
You see executions stuck in 'Running' state. This happens when a task times out and the callback never arrives. Use a heartbeat timeout and a fallback task that marks the execution as failed. Set a maximum execution age (e.g., 30 days) to clean up stale executions. Monitor the Step Functions execution history for patterns.
Next Steps: Build Your Own Qualitative Benchmark
This guide gave you a framework, not a prescription. The next step is to apply it to your specific workflow. Start by listing your state operations and categorizing them by consistency, latency, and cost needs. Then, prototype two or three approaches using the benchmark criteria. Run a load test with simulated failures—network partitions, function timeouts, concurrent writes. Document what breaks and how much effort it takes to fix.
Finally, share your findings with your team. The qualitative benchmark is a conversation starter, not a checkbox. It helps you make explicit trade-offs that are often implicit. Over time, you'll develop a sense for which patterns work for which problems. That intuition is more valuable than any single tool.
If you're starting from scratch, begin with a simple DynamoDB-based state store and then evolve as you hit limitations. Avoid premature optimization. Most serverless workflows don't need Redis or Step Functions until they do—and when they do, the benchmark will tell you.
Comments (0)
Please sign in to post a comment.
Don't have an account? Create one
No comments yet. Be the first to comment!