Prompt Engineering: The 'Director & Actor' Mental Model

Two engineers. Same model. Same task. One gets brilliant results. The other gets garbage.

The difference is not the model. It’s the prompt.

Prompt Engineering is the skill of communicating precisely with an LLM. It is not a soft skill — it is a rigorous engineering discipline with measurable, reproducible techniques.

This is the Mastery Guide to Prompt Engineering.

Part 1: Foundations (The Mental Model)

The Director and the Actor

Think of the LLM as the world’s most versatile Actor. They can play any role, adopt any persona, follow any script.

You are the Director.

A bad director says: “Be good.” The actor has no idea what that means.

A great director says: “You are playing a 1920s detective. You are world-weary, sarcastic, but brilliant. You speak in short, punchy sentences. You never break character.”

The same actor delivers wildly different performances based on the direction. Your prompt is the direction.

Part 2: The Investigation (Core Techniques)

1. System Prompt (Setting the Stage)

The system prompt defines the actor’s permanent role and constraints for the entire conversation.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
from openai import OpenAI

client = OpenAI()

SYSTEM = """You are a senior Python code reviewer at a fintech company.
Your job:
- Review code for correctness, security, and performance.
- Be direct and specific. No praise unless it's exceptional.
- Format feedback as: [CRITICAL] / [WARNING] / [SUGGESTION]
- Limit: 5 most important points only."""

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {"role": "system", "content": SYSTEM},
        {"role": "user", "content": f"Review this code:\n```python\n{code}\n```"}
    ]
)

Rules for great system prompts:

Define the role (“You are a…”).
Define the output format explicitly.
Define constraints (“No more than 5 items”, “Never say X”).
Define the audience (“Explain to a non-technical manager”).

2. Few-Shot Learning (Show, Don’t Tell)

Instead of explaining what you want, show examples.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
FEW_SHOT_PROMPT = """Classify customer feedback as POSITIVE, NEGATIVE, or NEUTRAL.

Examples:
Feedback: "The delivery was super fast, love it!"
Classification: POSITIVE

Feedback: "Package arrived damaged, very disappointed."
Classification: NEGATIVE

Feedback: "Order received."
Classification: NEUTRAL

Now classify:
Feedback: "{feedback}"
Classification:"""

Few-shot is dramatically more reliable than zero-shot for classification, formatting, and extraction tasks.

3. Chain-of-Thought (Force Reasoning)

For complex reasoning tasks, tell the model to think step by step before answering.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
# ❌ ZERO-SHOT: Model jumps to an answer (often wrong)
prompt = "Is 17 a prime number? Answer: "

# ✅ CHAIN-OF-THOUGHT: Model reasons first
prompt = """Is 17 a prime number?
Let's think step by step:
1. A prime number has no divisors other than 1 and itself.
2. Check divisibility: 17 / 2 = 8.5 (no), 17 / 3 = 5.67 (no), 17 / 4 = 4.25 (no)
3. We only need to check up to √17 ≈ 4.1
4. No divisors found.
Answer: Yes, 17 is prime."""

For GPT-4o and Claude 3.5+, simply adding “Think step by step” or “Let’s reason through this” measurably improves accuracy on math, logic, and multi-step tasks.

4. Structured Output (Taming the Wild JSON)

Never parse LLM output with string manipulation. Use structured output.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
from pydantic import BaseModel
from openai import OpenAI

client = OpenAI()

class ReviewResult(BaseModel):
    sentiment: str          # "POSITIVE" | "NEGATIVE" | "NEUTRAL"
    confidence: float       # 0.0 to 1.0
    key_issues: list[str]   # Top 3 issues/praises

response = client.beta.chat.completions.parse(
    model="gpt-4o",
    messages=[
        {"role": "system", "content": "You are a customer feedback analyzer."},
        {"role": "user", "content": f"Analyze: '{feedback}'"}
    ],
    response_format=ReviewResult  # Guaranteed valid JSON matching the schema
)

result = response.choices[0].message.parsed
print(result.sentiment)    # "NEGATIVE"
print(result.confidence)   # 0.95
print(result.key_issues)   # ["damaged package", "no refund offered"]

Part 3: The Diagnosis (Common Prompt Failures)

Problem	Cause	Fix
Inconsistent output format	No format specified	Add explicit format instructions + few-shot examples.
Model ignores constraints	Constraint buried at the end	Put critical constraints at the start of system prompt.
Hallucination on facts	Model guessing beyond its knowledge	Add: “If unsure, say ‘I don’t know’. Do not guess.”
Too verbose	No length constraint	Add: “Answer in 3 sentences maximum.”
Prompt injection	User overrides your instructions	Wrap user input: `[USER INPUT START]\n{input}\n[USER INPUT END]`

Part 4: The Resolution (Production Prompt Patterns)

The COSTAR Framework

Structure important prompts with 6 elements:

1
2
3
4
5
6
C — Context:   "You are helping a junior developer debug Python errors."
O — Objective: "Identify the root cause of the error."
S — Style:     "Explain like I'm 5 for concepts, technical for code."
T — Tone:      "Patient, encouraging, never condescending."
A — Audience:  "A developer with 1 year of experience."
R — Response:  "Format: [Root Cause] → [Explanation] → [Fix Code]"

Prompt Versioning (Treat Prompts Like Code)

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
# prompts/v2/code_reviewer.py
SYSTEM_PROMPT_V2 = """..."""  # Versioned, tested, tracked in Git

# Evaluate prompt changes with evals:
def eval_prompt(prompt: str, test_cases: list) -> float:
    correct = sum(
        run_prompt(prompt, tc["input"]) == tc["expected"]
        for tc in test_cases
    )
    return correct / len(test_cases)

score_v1 = eval_prompt(SYSTEM_PROMPT_V1, test_cases)  # 0.72
score_v2 = eval_prompt(SYSTEM_PROMPT_V2, test_cases)  # 0.89

Final Mental Model

1
2
3
4
5
6
7
8
System Prompt        → The Director's brief. Sets the permanent role.
Few-Shot Examples    → Show, don't tell. 3 examples beat 3 paragraphs of instructions.
Chain-of-Thought     → "Think step by step." Forces reasoning before concluding.
Structured Output    → Pydantic schema = guaranteed valid JSON. No parsing hacks.

temperature=0   → Deterministic. For factual tasks and evals.
temperature=0.7 → Creative. For content generation.
top_p=0.1       → Narrow probability. Only the most likely tokens.

Treat prompts like code: version them, test them with evals, review them in PRs, and never ship a prompt change to production without a benchmark.