Can You Answer These 6 System Design Interview Questions for L6 and L7 Roles?

DataAnnotation Recruiter

November 7, 2025

Summary

Master 6 advanced system design interview questions that test staff-level thinking. Learn what L6 interviewers evaluate.

Most system design interview prep optimizes for the wrong metric.

Candidates memorize architecture diagrams the way models optimize for benchmark scores: they learn to produce outputs that look correct without developing the underlying judgment that makes answers actually correct.

The result is engineers who can sketch a three-tier architecture for Twitter's feed but can't explain why their caching strategy would collapse under real load. We've observed this pattern after five years of managing 100,000+ remote workers who train frontier AI systems.

What makes these interviews hard: unlike coding questions with correct answers, system design is deliberately open-ended. Interviewers evaluate your thought process, not a final diagram. They're testing whether you ask the right questions, make reasonable trade-offs, and think in systems.

The framework matters as much as the solution.

This guide dives deep into six system design questions that appear in senior+ loops, reveals what interviewers actually evaluate, shows the failure modes we see repeatedly, and provides frameworks demonstrating the staff-level architectural thinking these interviews demand.

1. How do you design a URL shortener like bit.ly?

URL shortener questions test fundamental distributed systems thinking — unique ID generation, database design, caching strategy, and redirect handling at scale. Interviewers probe whether you understand read-heavy versus write-heavy systems, can estimate capacity requirements, and make pragmatic trade-offs between consistency and performance.

Right there, they want to evaluate four core capabilities:

Can you estimate the scale properly? Do you understand this is a 100:1 read-to-write ratio, not balanced traffic?
Do you prevent URL collisions? Can you explain the trade-offs between Base62 encoding and hashing?
Can you design for 100M+ daily requests? Do you architect for actual scale or theoretical perfection?
Do you consider cache invalidation? Do you understand that caching strategy matters more than database choice?

Most engineers commonly fail by jumping straight to the database schema without discussing requirements first. They skip back-of-envelope calculations, making scaling claims feel baseless. Many engineers also design around a single database, treating horizontal scaling as an afterthought.

Design approach

Strong candidates follow a structured approach that demonstrates architectural maturity.

You should start by clarifying requirements:

Daily active users
Custom URL support
Persistence duration
Analytics needs

These questions establish constraints that justify every decision afterward.

For a service like bit.ly, assume 100 million URLs created monthly with a 100:1 read-to-write ratio.

Next, run back-of-the-envelope calculations to ground your design in reality. With 100 million new URLs monthly, you're handling roughly 40 writes per second. At a 100:1 ratio, that's 4,000 redirects per second.

Storage grows predictably: 100 million URLs at 500 bytes each add up to 50GB annually. These numbers immediately tell you this is read-heavy and needs aggressive caching.

Your high-level architecture should reflect these constraints. Place a load balancer in front of stateless application servers that handle both URL creation and redirection. Behind them sits your primary database (PostgreSQL works well here), fronted by Redis for caching hot URLs.

This simple stack handles the scale while remaining operationally straightforward.

ID generation strategy

Deep dive into ID generation strategy because interviewers always probe this. Base62 encoding of auto-increment IDs gives you over 56 billion possible six-character URLs (62^6) while remaining human-readable.

Alternatively, hash-based approaches using MD5 provide natural distribution but require collision detection. Defend your choice with specific trade-offs: auto-increment is simpler but reveals creation order; hashing adds complexity but scales horizontally more easily.

Database design centers on a single URLs table with columns for short_url, long_url, created_at, and expiry. Index short_url for O(1) lookups during redirects. If you need custom aliases, add a uniqueness constraint and handle collisions gracefully with retry logic.

Caching strategy

Caching strategy determines user-experienced latency. Implement a cache-aside pattern where the application checks Redis first, then falls back to the database on cache miss. Set aggressive TTLs for popular URLs while keeping less-accessed ones in cold storage. This approach exploits the 80/20 rule: 80% of traffic hits 20% of URLs.

Interviewers will push on bottlenecks. At 10,000 writes per second, your single database saturates. Propose database sharding based on URL hash to evenly distribute writes. If cache miss rate climbs above 5%, consider pre-warming cache with trending URLs identified through analytics.

Common follow-up questions test your depth:

How do you prevent URL enumeration attacks? Rate limiting by IP address and API key
How would you implement analytics? Asynchronous event streaming to a separate analytics database
How do you handle expired URLs? Background workers scanning for TTL expiry with lazy deletion

Address these concerns proactively, and you demonstrate the operational thinking staff engineers need.

2. How would you design a distributed rate limiter?

Rate limiter questions evaluate your understanding of distributed systems constraints, algorithm trade-offs, and real-time decision-making under scale. Interviewers want to see whether you can protect shared resources, prevent abuse, and maintain service quality during unexpected traffic spikes.

They evaluate whether you understand core distributed systems concepts:

Do you know different rate-limiting algorithms? Can you compare the token bucket, leaky bucket, and sliding window, and explain their actual trade-offs?
Can you design for distributed environments? Do you understand why single-server counters don't work at scale?
Do you consider edge cases? What happens with clock skew across data centers, or with race conditions in concurrent requests?
Do you think about failure modes: What happens when your rate limiter itself becomes unavailable?

Most candidates fail by proposing centralized counters that don't scale horizontally, or by selecting algorithms without discussing trade-offs. Others ignore the distributed nature of modern systems, where multiple data centers must enforce limits consistently despite network partitions.

Some forget that rate limiters themselves are critical infrastructure that requires high availability.

Key clarifications

Begin by clarifying what you're limiting and why:

Are limits per-user, per-API endpoint, or per-IP address?
Do you need hard limits that immediately reject requests, or soft limits that queue overflow?
What's the time window — per second, minute, or hour?
Will this run in a single data center or globally?

These questions establish whether you need strong consistency or can tolerate eventual consistency across regions.

Algorithm selection

Algorithm selection drives your entire architecture, so compare options systematically:

Token bucket allows brief traffic bursts while maintaining average rate — perfect for API gateways serving bursty mobile clients.
Leaky bucket enforces strict constant outflow, useful for backend services that can't handle spikes.
Sliding window log provides the most accurate tracking, but consumes memory linearly with the request count.

Sliding window counterbalances accuracy and memory efficiency, making it popular for production systems. Choose based on your clarified requirements and defend the trade-offs.

System architecture

For high-level architecture serving millions of requests per second, sketch API gateway instances with rate limiter middleware, backed by a distributed Redis cluster for shared state. Store rate limit rules in a configuration service, allowing dynamic updates without deployment.

Return HTTP 429 with the Retry-After header when limits are exceeded. This design separates policy from enforcement, enabling operators to respond to attacks quickly.

Deep dive into implementation mechanics because details matter at scale. Use Redis atomic operations (INCR with EXPIRE) to maintain distributed counters safely. Handle clock skew across data centers by implementing loose synchronization with acceptable drift windows.

Prevent race conditions by using Lua scripts to execute multiple Redis commands atomically. Address Redis failure scenarios: do you fail open (allowing all traffic) or fail closed (blocking all traffic)? Justify your choice based on system priorities.

Scaling considerations

Scale considerations become critical when discussing staff-level thinking. A Redis cluster provides horizontal scaling through key sharding, but hot keys (such as rate limits for celebrity users) can cause an imbalance in load.

Implement local caching to reduce Redis queries by 90%, trading slight accuracy for massive throughput gains. Deploy CDN-level rate limiting as the first line of defense against DDoS attacks, protecting your entire origin infrastructure.

Interviewers can probe edge cases to test depth:

How do you handle distributed counting across regions when network partitions occur? Accept temporary over-limit allowance using gossip protocols for eventual consistency.
How do you rate-limit without user authentication? Fingerprint by IP plus User-Agent, acknowledging NAT limitations.
How do you prevent legitimate traffic from getting blocked? Implement multiple limit tiers with exponential backoff, reserving strictest limits for detected abuse patterns.

The key is showing you understand rate limiting not just as an algorithmic problem, but as a distributed systems challenge that requires careful trade-offs among accuracy, performance, and availability.

3. How do you design a distributed cache like Redis?

Distributed cache questions test your grasp of low-latency data access, consistency models, and availability trade-offs under failure conditions. Interviewers evaluate whether you understand when to prioritize speed over consistency, how replication affects durability, and why cache eviction policies matter operationally.

They're evaluating whether you think systematically about caching:

Do you understand cache eviction policies? Can you explain when LRU fails and why LFU or TTL-based eviction might be better?
Can you design for high availability? How do you handle cache node failures without losing all data?
Do you know caching patterns? Can you compare cache-aside, write-through, and write-behind with real trade-offs?
Do you consider cache invalidation? Can you handle the most complex problem in computer science?

Common failure patterns reveal shallow thinking. Candidates propose single-node caches with no availability story, or discuss consistency guarantees without acknowledging CAP theorem trade-offs.

Others focus entirely on get/set operations while ignoring memory management under pressure, cache warming strategies, or monitoring approaches that detect degradation before users notice.

Key requirements

Start with requirements that establish your design's constraints:

What queries per second (QPS) do you need to support — 10,000 or 1 million reads per second?
How much data must stay in memory — 10GB or 1TB?
Can you tolerate eventual consistency after failover, or do reads require strong consistency?
What's an acceptable cache miss rate given downstream database capacity?

These numbers ground every subsequent decision.

Data structure design

Core data structures determine performance characteristics. Implement a hash map providing O(1) lookups by key, paired with a doubly-linked list enabling O(1) LRU eviction. This combination delivers the speed users expect while intelligently managing memory.

For persistence, add an append-only log capturing writes so recovery after crashes doesn't lose recent updates. Balance memory usage against durability guarantees based on your clarified requirements.

High-level architecture for production deployment distributes data across clustered cache nodes using consistent hashing for key assignment. This minimizes rehashing when nodes join or leave, preserving cache hit rates during scaling events.

Add a replication factor of 3 (one primary, two replicas) to provide fault tolerance without excessive storage overhead. Place a stateless proxy layer in front that routes requests to correct nodes and handles failover transparently, shielding clients from cluster topology changes.

Caching patterns

Caching strategies fundamentally change application behavior, so explicitly discuss the trade-offs:

The cache-aside pattern gives applications complete control over cache population and invalidation, making it well-suited to read-heavy workloads with infrequent updates.
The write-through pattern updates the cache and database atomically, sacrificing write latency for guaranteed consistency.

Choose based on whether your system tolerates stale reads or requires strong consistency guarantees.

Likewise, eviction and expiration policies prevent memory exhaustion while keeping the working set cached:

Least Recently Used (LRU) removes the least-recently-used items, capturing temporal locality in most access patterns
Least Frequently Used (LFU) tracks access frequency, making it better suited to workloads with stable hot data. Combine approaches with TTL-based expiration, allowing explicit lifetimes.

Implement background scanning to delete expired keys, avoiding synchronous read overhead while lazily loading. Monitor the eviction rate as a leading indicator of insufficient memory before performance degrades.

Performance optimization

Scale considerations expose architectural maturity. Hot keys (data accessed orders of magnitude more than average) create load imbalance across cluster nodes. Replicate hot keys across multiple nodes to distribute read load horizontally. Implement local application-side caching for extremely hot data to eliminate network round-trips.

Address thundering herd during cache misses by implementing request coalescing, where multiple concurrent requests for the same key wait on a single upstream fetch.

4. How would you design a news feed like X (Twitter) or Facebook?

News feed questions test your ability to design read-heavy systems with real-time requirements, personalization at scale, and complex ranking algorithms. Interviewers probe whether you can balance latency against consistency, handle celebrity user edge cases, and reason about storage costs when pre-computing feeds.

Key evaluations

They evaluate your architectural thinking across several dimensions:

Do you understand fan-out patterns? Can you explain when to pre-compute and when to compute on demand?
Can you balance latency versus consistency? Do you know when eventual consistency is acceptable?
Do you consider celebrity user edge cases? How do you efficiently handle users with millions of followers?
Can you design ranking systems? Do you understand feed algorithms beyond simple chronological ordering?

Most engineers typically stumble by proposing pull-only models that can't meet latency requirements or push-only models that drive up storage costs. They don't consider the hybrid approach that production systems actually use.

Others ignore the complexity of ranking, treating feeds as simple chronological lists. Many also forget about privacy filtering, spam detection, or handling deleted posts across pre-computed feeds.

Requirements clarification establishes scale and constraints. For an X (Twitter) scale system, assume 300 million daily active users, an average of 200 followers per user, and users expect fresh content within seconds.

Ask whether feeds must be strictly chronological or algorithmically ranked. Clarify read-to-write ratio (typically 100:1) and whether real-time updates require WebSocket connections or polling suffices.

Fan-out strategy

Fan-out approaches determine your entire architecture, so discuss trade-offs methodically:

Fan-out on write pre-computes feeds by copying each post to all followers' timelines, delivering speedy reads but expensive writes for users with millions of followers
Fan-out on read computes feeds on demand by gathering posts from all followed users, minimizing write cost but potentially slowing down for large follow lists.

Production systems use a hybrid approach: pre-compute feeds for regular users, compute on demand for celebrities, and merge results at read time.

System architecture

High-level architecture reflects this hybrid approach. When users post, the content flows to a fan-out service that asynchronously distributes posts to followers' feed caches stored in Redis sorted sets.

For celebrity posts, skip fan-out entirely and store in a separate high-fan-out cache. Timeline service merges precomputed and on-demand feeds, ranks them by relevance, and returns the top results.

Store social graph in a dedicated graph database optimized for follower/following queries. Serve media content through CDN, keeping origin servers focused on feed logic.

Deep dive into implementation mechanics that distinguish staff-level thinking. Model feed storage using Redis sorted sets where scores represent timestamps or ranking signals, enabling efficient chronological or ranked retrieval. Implement feed trimming, keeping only the most recent 1,000 posts per user to prevent unbounded storage growth.

Handle post deletions through lazy cleanup during reads rather than trying to remove from millions of pre-computed feeds synchronously. Use message queues to buffer fan-out work and provide back-pressure when celebrity posts spike the load.

Feed ranking

Ranking and personalization layers add complexity worth discussing. Beyond simple chronological order, incorporate engagement signals (likes, replies, click-through rate) to surface relevant content.

Apply ML models predicting user interest based on historical interactions. Implement real-time spam filtering before posts reach feeds. Balance fresh content with popular posts using time-decay functions. Pre-compute ranking for cached feeds, but recalculate dynamically for on-demand portions.

Scale bottlenecks emerge around celebrity users and viral content. For example, when a user has 50 million followers, fan-out work overwhelms queue workers: partition follower lists and process in batches with rate limiting.

For viral posts receiving millions of likes per minute, aggregate engagement metrics are updated asynchronously rather than in real time. Implement read replicas for the feed cache to distribute query load horizontally. Monitor queue lag as an early warning system for capacity issues.

5. How do you design a video streaming service like YouTube?

Video streaming questions evaluate your understanding of content delivery at scale, encoding pipelines, storage economics, and handling massive global bandwidth requirements. Interviewers test whether you can design upload flows, processing pipelines, and playback systems that work reliably across terrible network conditions.

They test multiple dimensions of systems knowledge:

Do you understand video encoding? Can you explain transcoding, bitrates, and adaptive streaming protocols?
Can you design for global CDN distribution? Do you understand how CDNs reduce the load on origin servers?
Do you handle upload processing? Can you design asynchronous pipelines that process hours of video uploads per minute?
Do you consider storage costs? Do you understand petabyte-scale storage and lifecycle management?

Parse through these answers before moving forward.

Failure modes

Candidates commonly fail by focusing only on storage without discussing encoding complexity, or by proposing to serve videos directly from origin servers without a CDN architecture.

They don't consider how adaptive bitrate streaming works, ignore the upload bandwidth problem for multi-gigabyte files, or forget about metadata search and recommendation systems that drive discovery.

Requirements highlight

Start requirements gathering by establishing scale parameters. YouTube receives 500 hours of video uploaded every minute while serving millions of concurrent viewers. Storage scales to petabyte levels quickly — a single 4K video can consume 20GB after encoding across multiple quality tiers.

Ask about supported video qualities, upload size limits, acceptable buffering rates, and whether live streaming is required. These numbers inform every architectural choice.

Upload processing

The upload pipeline handles the most technically complex flow. Implement a resumable upload protocol so users can retry failed uploads from the last checkpoint rather than restarting from scratch. Use multipart uploads to split large files into manageable chunks uploaded in parallel.

Once the upload completes, trigger the video processing queue that fans out to the encoding worker farm.

Each worker transcodes the original file into multiple quality levels (360p, 720p, 1080p, 4K), plus generates thumbnails and extracts metadata. Store all artifacts in object storage, such as Amazon S3, which provides durability at scale.

Content delivery

Streaming architecture delivers content globally with minimal latency through CDN distribution. Store encoded videos as segmented HLS or DASH manifests, pointing to 2-10-second chunks. When users request video, edge servers serve manifests and chunks from cache, falling back to origin servers on cache miss.

Implement adaptive bitrate streaming where the client player automatically switches quality levels based on measured network conditions, preventing buffering. Pre-warm CDN caches for newly published videos from popular creators, ensuring first viewers get cache hits.

Discovery systems

Metadata and discovery systems enable users to find content among billions of videos. Store video metadata (title, description, upload date, view count) in a sharded relational database partitioned by creator ID.

Maintain an Elasticsearch search index asynchronously via change data capture, enabling sub-second search latency. Build a recommendation engine using collaborative filtering and content-based approaches, precomputing suggestions and caching them.

Track view count and engagement metrics through an event streaming pipeline aggregating statistics in a real-time data warehouse.

Scale considerations dominate operational discussions. Video encoding compute represents a massive ongoing cost — optimize by detecting duplicate uploads via perceptual hashing before transcoding. Storage costs grow unboundedly — implement lifecycle policies moving older, less-watched content to cheaper cold storage tiers.

Bandwidth costs at CDN scale become significant—negotiate committed-use discounts and implement innovative caching policies that prioritize popular content. Monitor cache hit rates above 95%, encoding queue lag under 5 minutes, and playback start time under 2 seconds.

Edge cases

Interviewers can then probe failure scenarios and edge cases:

How do you handle corrupted uploads? Validate file headers and run integrity checks before queuing for encoding.
What happens when encoding fails? Implement retry with exponential backoff and dead-letter queues for manual investigation.
How do you prevent copyright violations? Implement a content ID system that matches uploaded videos against a reference database.
How do you handle sudden viral spikes? CDN absorption, combined with origin rate limiting, protects the infrastructure while serving cached content.

These failure scenarios test whether you think beyond happy path designs to real-world operational challenges. Strong candidates proactively address edge cases, demonstrating the production mindset that distinguishes staff-level engineers from those still learning system design fundamentals.

6. How would you design a messaging system like WhatsApp or Slack?

Messaging system questions test your understanding of real-time communication, WebSocket connection management, message delivery guarantees, and scaling persistent connections across millions of users.

Interviewers evaluate whether you can design for low latency, handle offline users gracefully, and ensure messages arrive exactly once in the correct order.

Key evaluation

They evaluate your understanding of real-time systems:

Do you understand WebSocket versus polling? Can you explain why persistent connections matter for real-time messaging?
Can you design message persistence? How do you store billions of messages with fast retrieval?
Do you handle offline users? What happens to messages when recipients aren't connected?
Can you implement read receipts? How do you track message state across a distributed system?

Common failures reveal gaps in distributed systems thinking. Candidates propose HTTP polling instead of persistent connections, which wastes bandwidth and adds latency. They don't discuss message ordering guarantees or explain how offline message delivery works.

Many also ignore the complexity of group chat scaling — naive fan-out to thousands of members creates hotspots. Others forget about read receipts, typing indicators, and presence information that users expect.

Requirements clarification

Requirements clarification establishes your system's scope:

Confirm support for one-to-one and group conversations, expected message delivery latency, offline message retention period, and whether media sharing is required.
Ask about read receipts, typing indicators, and end-to-end encryption requirements.

For WhatsApp scale, assume over 3 billion users and billions of messages daily, implying 1 million messages per second at peak.

This helps you make core architectural decisions.

Connection architecture

Core architecture centers on persistent connections for real-time delivery. Deploy WebSocket gateway servers maintaining long-lived connections to online clients, enabling sub-100ms message delivery.

Behind gateways sits a message-routing service that determines the destination users and invokes delivery logic. Use a message queue like Kafka for reliable delivery — messages persist in the queue until the recipient acknowledges them.

Store message history in a NoSQL database, such as Cassandra, partitioned by conversation ID for efficient retrieval.

Message delivery

Message flow demonstrates your systems thinking. When the sender transmits a message, it reaches the WebSocket gateway, which forwards it to the message service. The message service writes to a Kafka topic, persists to Cassandra, and attempts immediate delivery if the recipient is online.

Delivery workers consume from Kafka and push messages to the recipient's WebSocket connection. If the recipient is offline, the message waits in Kafka with delivery retry logic. Client acknowledges receipt, allowing Kafka to mark the message consumed.

This flow provides at-least-once delivery semantics with client-side deduplication, achieving exactly-once.

Group chat introduces scaling complexity worth discussing. For small groups of fewer than 100 members, fan out messages to all member connections synchronously. For large groups like company-wide channels, don't fan out. Instead, publish once to the group's Kafka topic and have clients subscribe.

Track per-user read cursors to indicate the last-seen message, enabling catch-up when users come online. Implement pagination for historical message retrieval to prevent large groups from overwhelming clients during join.

Additional features demonstrate attention to product details that interviewers value. Typing indicators use ephemeral events stored in Redis with a 5-second TTL — no need to persist these transient signals. Read receipts update message metadata in Cassandra when clients confirm reading.

Presence information tracks online status through heartbeat protocol, timing out connections after 30 seconds of inactivity. Media sharing uploads files to object storage and inserts the URL into the message payload rather than inline binary data.

Scaling challenges

Scale considerations reveal operational maturity. WebSocket connections consume server memory — each connection requires 10KB, so 1 million connections need 10GB RAM per gateway node.

Implement connection draining during deployments to enable graceful failover. Message queues accumulate during offline periods — monitor lag and scale delivery workers dynamically. Implement backfill limits to prevent users from being offline for weeks and from receiving thousands of queued messages simultaneously.

Contribute to AGI development through AI code evaluation

If you have system design and coding expertise, you can contribute to the most crucial technology being built today — Artificial General Intelligence.

Every frontier model (ChatGPT, Claude, Gemini) depends on human expertise that algorithms cannot replicate. As models advance, this dependence intensifies. AI labs building toward AGI need engineers who can evaluate AI-generated code with solid architectural judgment.

At DataAnnotation, code evaluation projects start at $40 per hour and let you apply your technical expertise directly to advancing frontier AI systems. You're not debugging legacy enterprise software — you're teaching models the engineering judgment that separates functional code from excellent code.

This work shapes systems that millions of people interact with. When you evaluate AI responses, your judgment determines whether billion-dollar training runs advance capabilities or optimize for the wrong objectives — whether models develop reasoning that generalizes or just memorize patterns.

Getting from interested to earning takes five straightforward steps:

Visit DataAnnotation and click "Apply"
Complete the brief form with your background
Take the Coding Starter Assessment
Check your inbox for approval (arrives within days)
Log in, choose projects, and start earning $40+ per hour

No signup fees. DataAnnotation stays selective to maintain quality. You can only take the Starter Assessment once, so read the instructions carefully and review before submitting.

Apply to DataAnnotation to put your expertise to help build AGI, and want your architectural judgment to advance the most crucial technology of our time.

DataAnnotation Recruiter

JP is a software engineer turned digital marketer based in Texas. He graduated from the University of Texas at Dallas with a degree in Software Engineering and began his career as a fullstack developer in fintech. Drawing on his technical background, JP transitioned into digital marketing freelancing, where he combines his engineering expertise with creative strategy. He brings a unique blend of technical and marketing skills to the DataAnnotation team.

FAQs

How do I get paid?

We send payments via PayPal. Deposits will be delivered within a few days after you request them.

It is very important that you provide the correct email address associated with your PayPal account. If you do not have a PayPal account, you will need to create one with an email address that you use.

How flexible is the work?

Very! You choose when to work, how much to work, and which projects you’d like to work on. Work is available 24/7/365.

Can You Answer These 6 System Design Interview Questions for L6 and L7 Roles?

Summary