“Our AI keeps responding in English instead of Vietnamese. Should we fine-tune?”
“Our AI doesn’t know about our new product launched last week. Should we fine-tune?”
The answer to both is almost never the same. Yet engineers confuse the two constantly — and one wrong choice costs weeks and thousands of dollars.
Part 1: Foundations (The Mental Model)
The Medical School Analogy
RAG = Giving a doctor a reference book before every patient visit.
- “Here’s relevant information for this patient. Now diagnose.”
- The doctor’s underlying medical knowledge is unchanged.
- Perfect for: current information, company-specific data.
Fine-tuning = Sending the doctor to actual medical school.
- The doctor’s brain is re-trained. They internalize new knowledge and behaviors.
- Perfect for: changing how the model behaves, speaks, and reasons.
1
2
3
4
5
| RAG Fine-tuning
Use when: "Model lacks knowledge" "Model lacks skill/style"
Cost: Low (just indexing) High ($$ GPU hours)
Updatable: Instantly (re-index) Hard (retrain)
Example: "Know our FAQ" "Always respond in our brand voice"
|
Part 2: The Investigation (When Fine-Tuning Wins)
Fine-tuning changes the model’s weight — its fundamental behavior. Use it when you need:
- Consistent format/style: “Always respond as bullet points in markdown.”
- Domain language: Medical jargon, legal language, code in a specific style.
- Task specialization: A model that only does SQL generation, fast and reliably.
- Language/dialect: Teaching a model to write natural Vietnamese (not translated-sounding).
When RAG is enough (use this first, always):
- The model just needs up-to-date facts it doesn’t know.
- You need to cite sources in your answer.
- Data changes frequently (product catalog, pricing).
Part 3: The Diagnosis (LoRA — Fine-tuning Without a Supercomputer)
Full fine-tuning updates ALL of a model’s billions of parameters. Prohibitively expensive.
LoRA (Low-Rank Adaptation) is the breakthrough that made fine-tuning accessible. Instead of updating all weights, it adds small adapter matrices to key layers. Only the adapters are trained (~1% of parameters).
1
2
| Full Fine-Tuning: Update 7 billion parameters → needs 80GB GPU × 4 days
LoRA Fine-Tuning: Update ~70 million adapter params → needs 16GB GPU × 2 hours
|
Fine-tuning with Unsloth + LoRA (Python)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
| from unsloth import FastLanguageModel
from trl import SFTTrainer
from datasets import Dataset
# Load base model with 4-bit quantization (fits on a single consumer GPU)
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="unsloth/llama-3-8b",
max_seq_length=2048,
load_in_4bit=True, # 4-bit quantization: 8B model fits in ~6GB VRAM
)
# Apply LoRA adapters
model = FastLanguageModel.get_peft_model(
model,
r=16, # LoRA rank: higher = more capacity but more params
lora_alpha=16,
target_modules=["q_proj", "v_proj"], # Which layers to adapt
)
# Your training data (instruction → response pairs)
data = Dataset.from_list([
{"text": f"### Instruction:\n{ex['input']}\n\n### Response:\n{ex['output']}"}
for ex in your_training_data
])
trainer = SFTTrainer(
model=model,
train_dataset=data,
dataset_text_field="text",
max_seq_length=2048,
)
trainer.train()
# Save only the adapters (small: ~50MB vs 16GB for the full model)
model.save_pretrained("my-lora-adapter")
|
Part 4: The Resolution (Decision Framework)
1
2
3
4
5
6
7
8
9
10
11
| Problem: "AI doesn't know X"
│
├── X changes frequently? → RAG (re-index = done)
│
├── X is private/proprietary docs? → RAG
│
└── X is a skill/behavior/style? → Fine-tune
│
├── Budget < $100? → LoRA on open model (Llama 3, Mistral)
│
└── Budget flexible? → OpenAI fine-tuning API (pay per token)
|
OpenAI Fine-Tuning API (Managed, No GPU)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
| from openai import OpenAI
import json
client = OpenAI()
# 1. Upload training data (JSONL format, min 10 examples)
with open("training.jsonl", "w") as f:
for ex in training_data:
f.write(json.dumps({
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": ex["question"]},
{"role": "assistant", "content": ex["answer"]}
]
}) + "\n")
file = client.files.create(file=open("training.jsonl", "rb"), purpose="fine-tune")
# 2. Start fine-tuning job
job = client.fine_tuning.jobs.create(
training_file=file.id,
model="gpt-4o-mini" # Fine-tune the smaller model (cheaper)
)
# 3. Use your fine-tuned model
response = client.chat.completions.create(
model=job.fine_tuned_model, # e.g., "ft:gpt-4o-mini:acme:v1:abc123"
messages=[{"role": "user", "content": "..."}]
)
|
Final Mental Model
1
2
3
4
5
6
7
| RAG → Give the doctor a reference book. Instant. Citable. Updatable.
Fine-tuning → Send the doctor to med school. Permanent. Expensive. Powerful.
LoRA → Surgically add adapter layers. Train 1% of params. 90% of the effect.
Full FT → Retrain the entire brain. 100x more expensive. Rarely necessary.
Start with RAG. Fine-tune only when RAG can't fix it.
|
The 2026 rule: 90% of AI product problems are solved by better prompts + RAG. Fine-tune when you’ve exhausted both. LoRA when you fine-tune.