Featured image of post Observability: Logs vs. Metrics vs. Tracing — The 'Doctor's Kit' Mental Model

Observability: Logs vs. Metrics vs. Tracing — The 'Doctor's Kit' Mental Model

Your app is slow. Is it the DB? The queue? The network? A mastery guide to the three pillars of observability and how they work together.

Your production app is slow. A user complains. Your manager pings you on Slack.

You SSH into the server and… stare at it.

Without observability, debugging production is like a doctor diagnosing a patient they cannot see, touch, or talk to. This guide gives you Doctor’s instruments.


Part 1: Foundations (The Mental Model)

The Doctor’s Kit

A doctor has three instruments for different questions:

  • Stethoscope (Logs): “Let me LISTEN to what the patient is saying moment to moment.” Raw events. Timestamped. Detailed. Noisy.
  • Vital Signs Dashboard (Metrics): “Let me CHECK key numbers: heart rate, blood pressure, temperature.” Aggregated. Numeric. Great for dashboards.
  • X-Ray (Tracing): “Let me SEE inside and follow this exact problem through every organ.” End-to-end request flow across services.
PillarQuestion AnsweredUnitTool
Logs“What exactly happened?”Text eventsDatadog, Loki, ELK Stack
Metrics“How is the system in general?”Numbers over timePrometheus, Grafana, Datadog
Tracing“Where did this request slow down?”Spans + TracesJaeger, Zipkin, OpenTelemetry

Part 2: The Investigation (Each Pillar Deep Dive)

1. Logs — “The Patient’s Diary”

Logs are timestamped records of what happened. The key discipline: structured logging.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
# ❌ BAD: Unstructured (hard to query)
print(f"Error processing order for user {user_id}")

# ✅ GOOD: Structured JSON (searchable, filterable)
import structlog
log = structlog.get_logger()

log.error(
    "order_processing_failed",
    user_id=user_id,
    order_id=order_id,
    reason=str(e),
    duration_ms=123
)
# Output: {"event": "order_processing_failed", "user_id": 42, "order_id": 999, ...}

Log Levels (Use them correctly!):

  • DEBUG: Verbose detail for developers. Off in production.
  • INFO: Normal operations. “User 42 logged in.”
  • WARNING: Something bad might happen. “Disk is 80% full.”
  • ERROR: Something failed. “Payment failed for order 999.”
  • CRITICAL: The app is dying. On-call engineer wakes up.

2. Metrics — “The Vital Signs”

Metrics tell you how things are trending over time. The four Golden Signals (Google SRE):

  • Latency: How long do requests take? (p50, p95, p99)
  • Traffic: How many requests per second?
  • Errors: What % of requests are failing?
  • Saturation: How full is your system? (CPU %, Memory %, Queue depth)
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
# Using Prometheus with Python (via prometheus-client)
from prometheus_client import Counter, Histogram

REQUEST_LATENCY = Histogram("api_request_duration_seconds", "API request latency")
ERROR_COUNT = Counter("api_errors_total", "Total API errors", ["endpoint"])

@REQUEST_LATENCY.time()  # Automatically times the function
def process_order(order_id):
    try:
        ...
    except Exception:
        ERROR_COUNT.labels(endpoint="/orders").inc()

3. Tracing — “The X-Ray”

In a microservices system, one user request might touch 7 services. If it takes 2 seconds, where did those 2 seconds go?

Tracing answers this with a Waterfall diagram:

1
2
3
4
5
6
Request: GET /checkout (2.0s total)
├── Auth Service (0.05s) ✅
├── Cart Service (0.08s) ✅
├── Inventory Service (1.6s) ← 🔴 THE BOTTLENECK
│   └── DB Query: SELECT ... (1.55s)  ← THE ACTUAL SLOW THING
└── Payment Service (0.27s) ✅

OpenTelemetry is the open standard. Write it once; send to any backend (Jaeger, Datadog, etc.):

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
from opentelemetry import trace

tracer = trace.get_tracer(__name__)

def checkout(user_id):
    with tracer.start_as_current_span("checkout") as span:
        span.set_attribute("user.id", user_id)
        
        with tracer.start_as_current_span("fetch_cart"):
            cart = get_cart(user_id)  # This span will be a child
        
        with tracer.start_as_current_span("process_payment"):
            charge(cart.total)  # This span will be a child

Part 3: The Diagnosis (When to Use What)

ProblemFirst Look AtWhy
“Error 500 for user 42”LogsFind the exact error message and stack trace.
“API is slow since 3 PM”MetricsFind the P95 latency spike on the dashboard.
“This request takes 2s”TracingSee the Waterfall and find which service is slow.
“Server is down”MetricsCPU/Memory alerts triggered before the crash.

Part 4: The Resolution (The Alerting Stack)

You cannot watch dashboards 24/7. You need alerts.

Good Alert: “P99 latency exceeded 2 seconds for 5 minutes.” (Specific, actionable). Bad Alert: “CPU is above 50%.” (Vague, fires all the time, causes alert fatigue).

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
# Prometheus alerting rule
groups:
- name: api_alerts
  rules:
  - alert: HighLatencyP99
    # Fire if 99th percentile latency > 2s for 5 minutes
    expr: histogram_quantile(0.99, api_request_duration_seconds_bucket) > 2
    for: 5m
    annotations:
      summary: "API P99 latency is critically high"

Final Mental Model

1
2
3
4
5
6
7
Logs    -> The Patient's Diary. "What happened? When? With what details?"
Metrics -> The Vital Signs. "How is the system trending over time?"
Tracing -> The X-Ray. "Where did THIS specific request go wrong?"

Use Logs to understand events.
Use Metrics to build dashboards and alerts.
Use Tracing to debug slow distributed transactions.

You need all three. Logs without Metrics is flying blind. Metrics without Tracing is knowing you’re sick but not knowing why.

Made with laziness love 🦥

Subscribe to My Newsletter