Abstract
At Netflix scale - billions of requests per day and petabytes of key-value data - even small inefficiencies in storage and network paths become expensive. This talk shares how we reduced the footprint of both keys and values through efficient binary tuple encoding and dynamic dictionary-based compression for a high-throughput, storage-agnostic key-value platform, and what we learned when database realities met compression theory.
We will walk through the engineering journey from prototype to production-minded rollout: selecting training data from live workloads, comparing compression strategies, and balancing quality against operational cost. Keys, which must remain ordered and stable, and Values which are highly variable in format and size require different approaches to optimally store. Furthermore, those approaches are data-dependent, so we will show how data shape directly influences encoding effectiveness and where naive approaches fail.
Most importantly, we will focus on system outcomes beyond compression ratio and how engineers gain trust in the approaches with rigorous verification. We will examine the impact on database/storage footprint, compaction and cache behavior, network IO, and p99 latency guardrails. We will also cover reliability patterns required in real systems: synthetic verification, simulation testing, dictionary versioning, compatibility/fallback paths, safe rollout controls, and failure handling when training signals are noisy or incomplete.
Attendees will leave with a practical framework for applying optimal encoding techniques in distributed storage systems: how to choose training pipelines, what signals to monitor, and how to get measurable efficiency gains without sacrificing latency or correctness.
What you will learn:
- Techniques for encoding both Keys and Values efficiently, they require different approaches!
- How to evaluate compression strategies using database/storage metrics (not just compression ratio), including footprint, IO, cache behavior, and tail latency.
- How workload characteristics (value sizes, churn, hot-key skew) should drive training-sample strategy and dictionary lifecycle decisions.
- How to design safe production rollouts with versioning, compatibility/fallback paths, observability, and fast rollback controls.
- How to build a repeatable and high confidence verification approach to compare training pipelines and make evidence-based trade-offs between efficiency and latency.
Speaker
Joseph Lynch
Principal Software Engineer @Netflix Building Highly-Reliable and High-Leverage Infrastructure Across Stateless and Stateful Services
Joseph Lynch is a Principal Software Engineer for Netflix who focuses on building highly-reliable and high-leverage infrastructure across our stateless and stateful services. He led the shift of the Netflix data tier to abstraction, driving resilience through a Data Gateway architecture. He loves building distributed systems and learning the fun and exciting ways that they scale, operate, and break. Having wrangled many large scale distributed systems over the years, he currently spends much of his time building automated verification and fast deployment of every flavor of service at Netflix.
Find Joseph Lynch at:
Speaker
Ayushi Singh
Senior Software Engineer @Netflix Specializing in Large-Scale Distributed Data Systems
Ayushi is a Senior Software Engineer specializing in large-scale distributed data systems. With deep expertise in Key-Value and Apache Cassandra platforms, she focuses on building resilient, secure-by-default infrastructure capable of handling massive production traffic. Ayushi has a proven track record of leading complex data engineering initiatives, including capacity planning for live workloads, hardening database deployments, and optimizing data repair and consistency workflows. Passionate about system reliability, she has successfully led seamless, zero-downtime database migrations for critical systems serving millions of requests per second.