Your user clicks “Send Email”. Your API freezes for 3 seconds while it talks to the SMTP server. Then it finally responds “OK.”
That 3-second wait is a crime.
Every operation that doesn’t need to happen right now should happen later in the background. This is the entire philosophy behind task queues and message brokers.
This is the Mastery Guide to async processing — from the simplest Django task queue to the planet-scale Kafka.
Part 1: Foundations (The Mental Model)
The core idea is simple: Decouple the “Request” from the “Work”.
| |
The “Post Office” Hierarchy
Think of it like three levels of a mail system:
Task Queue (Celery / Django-Q) = The Mailbox on your doorstep Simple. Local. You dump letters in, the postman picks them up. You don’t think about where they go.
Message Broker (RabbitMQ) = The Post Office Branch Smart routing. You drop off a package labeled “URGENT” and another labeled “Standard”. The clerk at the counter knows to send Urgent to the overnight van and Standard to the weekly truck. Multiple senders, multiple receivers.
Event Streaming (Kafka) = The National Newspaper Everyone publishes articles (events). Everyone subscribes and reads at their own pace. Old articles are kept in the archive. You can re-read (replay) last Tuesday’s paper anytime you want.
Part 2: The Stack (Who Does What)
Understanding layers is critical. They solve different problems.
Layer 1: The Task Queue (Celery / Django-Q)
This is where you start. You have a Django app, and you need things to run in the background.
What it is: A Python library that runs functions asynchronously in a separate process (worker).
What it needs: A Broker to store the task messages. (By default, Redis or RabbitMQ).
| |
Use Celery when: Sending emails, resizing images, generating PDFs, running nightly imports.
Layer 2: The Message Broker (RabbitMQ)
This is the “middleware”. It is a dedicated server that receives, stores, and routes messages.
Key Concepts:
- Producer: Sends messages into the broker.
- Queue: A named buffer that holds messages.
- Consumer: Pulls messages from the queue to process them.
- Exchange: The “router”. Based on a routing key, it decides which queue a message goes to.
Use RabbitMQ when: You have multiple services that need to talk to each other asynchronously. “When an Order is placed, notify the Inventory Service AND the Notification Service.”
Layer 3: The Event Streaming Platform (Kafka)
Kafka is a fundamentally different beast. It is not a queue — it is a distributed, append-only log.
Key Concepts:
- Topic: Like a table in a database, or a news channel.
- Partition: A topic is split into partitions for parallelism. (Like multiple lanes on a highway).
- Offset: A sequential number for each message. Consumer X is at offset 1500. Consumer Y is at offset 2300. Messages are never deleted (until a retention period).
- Consumer Group: Multiple consumers working together to process a topic. Kafka divides partitions among them.
Use Kafka when: You need a permanent, replayable audit log of everything that happened. “Every payment event, forever. Any service can subscribe now or catch up from 3 months ago.”
Part 3: The Investigation (Debug Like a Pro)
1. Monitor Your Celery Workers
| |
2. The Dead Letter Queue (DLQ)
The most important pattern you must implement. If a task fails permanently (retried 3 times, still failed), where does it go?
- Without DLQ: The message is silently dropped. 🔥 Data loss.
- With DLQ: The failed message is moved to a special “Dead Letter” queue. You can inspect it, fix the bug, and re-queue the messages.
In RabbitMQ: Configured via x-dead-letter-exchange.
In Celery: Use the on_failure hook or task-level dead letter handling.
3. Kafka Consumer Lag
The most important metric in Kafka. Consumer Lag = How far behind a consumer is from the latest message.
| |
If lag is growing, your consumers are not fast enough. You need more consumer instances (up to the number of partitions).
Part 4: The Diagnosis (Common Failures)
| Symptom | Cause | Fix |
|---|---|---|
| Queue keeps growing | Workers are too slow | Add more Celery workers. Optimize the task. |
| Same task runs twice | No idempotency. Worker crashed mid-task, broker retried. | Make your task idempotent: “If order #123 is already processed, skip.” |
| Django-Q vs Celery | Django-Q needs no separate broker (uses DB). Celery is more powerful but needs Redis/RabbitMQ. | Use Django-Q for simple project. Use Celery for production. |
| Kafka message loss | acks=1 means only the leader acknowledged. Leader crashes before replication. | Set acks=all on the Producer. |
| Kafka cannot scale | Topic has only 1 partition. You can’t have more consumers than partitions. | Increase partition count at Topic creation. (Can’t reduce later!) |
Part 5: The Resolution (Python Cookbook)
1. Celery with Django (The Standard Stack)
| |
2. Django-Q (Zero Infrastructure)
When you can’t be bothered to set up Redis/RabbitMQ for a small project.
| |
3. Kafka with Python (The Event Log)
| |
Final Mental Model
| |
Decision Guide:
- Solo Django app, background tasks? →
Django-Q(no extra infra). - Production Django app? →
Celery + Redis. - Multiple microservices talking to each other? →
RabbitMQ. - Event sourcing, audit log, cross-team data pipeline? →
Kafka.
