Digonto
Digonto

The Incident That Taught Me Data Pipeline Segregation

August 10, 202510 min readProduction Incidents

It was a usual Sunday afternoon, things were chill and suddenly I hear the familiar Slack notification. A new live-support incident.

Slack notification showing the production incident alert

I saw the message and immediately knew it was trouble. We were hosting a 5000 participant webinar, which I could already assume went far from ideal.

The Alerts Storm

Redis monitoring showing connection metrics during the incident
Redis monitoring showing latency spikes during the incident

Our Redis metrics were showing unusual patterns - cpu usage and latency were spiking exponentially, a clear sign that something was wrong with our system.

Backend latency metrics showing increased response times
Gateway timeout errors during the incident

Backend latency was climbing exponentially while gateway timeouts started flooding our error logs - parts of our system were crashing down.

The Investigation Begins

We are a WebRTC / video conferencing platform, the session microservice is the source of truth for realtime meetings / sessions and their analytics data.

Sessions is a critical micro-service, relied upon by recording microservice, AI-related workers like live-transcription, and more. A failure here means critical business impact, and being the maintainer of the sessions microservice, I felt the gravity of it and quickly got to action. A teammate joined me on this effort.

I looked around Slack and found recent threads started by the SRE team regarding heavy lock usage on AWS RDS. The on-call engineer was getting paged about the service degradation.

AWS RDS CPU usage graph showing high utilization

So, the initial presumption was PostgreSQL got choked and all queries were hung. I started to simulate using a local script calling preprod nodes on Kubernetes - everything went fine. Bombarded the pods with millions of queries, slow but all went fine.

The Breakthrough

We tried one more time, stress testing our preprod nodes, plus invoking a few queries by-hand to the same nodes connected to the same DB. Bang, the queries got stuck now.

We started eliminating possibilities systematically, testing each possibility in isolation:

Pod Health Check:Pods were responding normally, no hangs, <5% CPU
Database Status:PostgreSQL was under load but still operational
Sync Queries:Sync queries executed successfully when run in each batch
Database Queries:Direct database queries through PGAdmin resolved fine

We were like, wtf was wrong then?

The Root Cause: ORM Pool Saturation

A realization dawned upon us: the DB was not the problem, it was the data pipeline, especifically the ORM pool had saturated. The important createSession() & getActiveSession() queries were waiting behind numerous DB update queries.

To understand why it's an issue, we need to understand that session broadly processes 3 types of events:

1.
Primary Events:Consumes 100s of PeerJoin / PeerLeave events from socket services, and aggregates usage, tracks users, reports analytics etc.
2.
Critical Events:Creates sessions, serves a few but critical read session to other microservices and reports whether the session is live or ended, etc.
3.
Analytics Reading:Serves Website graph queries, billing related and analytics related queries from RDS Reader instances.

Primary events are written to the RDS writer instance, and Critical events also need to be served from the writer to avoid potential replication lag under load. This means both the Primary and Critical events were competing for the same limited connection pool to RDS writer.

But the primary events being high volume, they were saturating the pool, causing critical events to wait for connections.

At peak times, the pool was always saturated by ~40 concurrent events from RabbitMQ, each of its handler parallely querying the DB with up to 2-3 queries, making the number of waiting queries per pool connection to (40 ÷ 10) × 3 = 12

# The Math:
RabbitMQ prefetch: 40 messages
ORM pool size: 10 connections
RabbitMQ event per pool connection: 4
Parallel query per event: 3
Query time under load: ~600ms
Wait time for Critical Events = 12 queries × 600ms = 7.2 seconds! 😱

Therefore every Critical event had to wait for 7 seconds before it could lay its hands on the DB. This caused all session creation and critical operations to fail. 😢

Old Architecture: Single Connection Pool

Architecture before changes - single connection pool causing bottleneck

The Solution: Segregated Connection Pools

We realized that our important queries needed a different ORM pool which could give them congestion-free access to the database and be prioritized, especially so when the DB itself is under load and query latency is higher.

Solution: Segregated Connection Pools

Architecture after changes - segregated pools for different query priorities

We shifted from having one global ORM to two specialized ORMs with separate connection pools.

Code Glimpse: Dual ORM Setup

Code showing the implementation of two separate ORM instances with specialized connection pools

Immediate Results

We instantly saw the critical query read time dropping from >5 seconds to <50ms!

Aggressive Performance Tuning

We didn't stop here, we aggressively tuned sessions-ms that day:

1.

Made sure our ORM generated queries were using proper index, fixed 2 queries joining 20mil+ rows without an index, and instantly made overall queries 5x faster.

2.

Added multi-column index on a hot path, where postgres refused to use the single-column indexes, again made 2x performance gain.

3.

Moved most of the frequent update operations to Redis, drastically reducing DB locks.

4.

Increased ORM pool size from 10 to 40 per pod: we arrived at this value by stress testing at what concurrency do we reach ~50% CPU limits.

Key Takeaways

1

Not All Queries Are Equal

Critical internal queries should never wait behind batch processing or analytics workloads. Segregate your connection pools based on query priority and latency requirements.

2

The Database Might Not Be the Bottleneck

Sometimes the issue isn't database performance but how your application accesses it. Connection pool exhaustion can make a healthy database appear overloaded.

3

Load Testing ≠ Production Reality

Our load tests didn't catch this because they focused on throughput, not the interaction between different query types competing for the same resources.

4

Monitor Pool Metrics

Connection pool wait times, active connections, and queue depth are critical metrics that should be monitored and alerted on.

One Year Later

It's been over a year now, and we've hardly ever faced sessions-ms performance degradation, even though our platform usage grew 100% almost every quarter. The incident that initially seemed like a disaster became a valuable learning experience that made our system more resilient.

100%
Quarterly Growth
<1ms
P99 Latency
Zero
Pool Timeouts