Engineers mix up partitioning, bucketing, sharding, and replication constantly.
Edge computing is usually not confused with those terms, but it often gets incorrectly proposed as a substitute for them (e.g., “let’s use edge instead of sharding”).
These concepts live at different layers and solve different bottlenecks.
This post gives you a clean mental model and practical examples so you can choose the right tool.
The One Diagram to Remember
Everything in this post reduces to this hierarchy:
| |
Quick Definitions (one sentence each)
Partitioning Split a table into logical segments inside one database to reduce scan and maintenance cost.
Bucketing Hash rows into fixed groups inside one database to improve join and parallel scan efficiency.
Sharding Split data across multiple database instances to scale storage and write throughput.
Replication Copy the same data to multiple nodes to improve availability and read scalability.
Edge computing Run compute closer to users to reduce latency.
A More Visual Real-Life Model: Shipping Warehouses
Imagine an online store with packages.
- A database instance is one warehouse.
- A database cluster is multiple warehouses.
- Users are customers around the world.
Now map the terms:
- Partitioning: split packages into sections like “January”, “February”, “Returns”, so workers search less.
- Bucketing: inside each section, assign packages into fixed lanes (Lane 0..31) by hashing the order id.
- Sharding: open more warehouses; each warehouse owns a subset of orders.
- Replication: keep copies of the same inventory data in multiple warehouses for failover and read load.
- Edge computing: place small local depots near customers to serve common requests faster (mostly caching + routing).
If you prefer a verb memory trick:
- Partitioning: split by value
- Bucketing: split by hash
- Sharding: split across warehouses
- Replication: copy to more warehouses
- Edge: move compute closer to customers
The 30-Second Decision Tree
When someone says “we need sharding”, run this checklist:
| |
Layer 1: Partitioning (split by value)
Partitioning divides one logical table into smaller logical segments inside the same database.
Example structure:
| |
Example SQL:
| |
Why partitioning exists:
- Faster queries via partition pruning
- Faster deletes and archival
- Smaller indexes
- Easier maintenance
What partitioning does NOT do:
- Does not increase total storage capacity
- Does not increase write throughput beyond one database
Layer 2: Bucketing (hash distribution within one database)
Bucketing distributes rows into a fixed number of buckets using a hash function.
Example:
| |
Example logic:
| |
Why bucketing exists:
- Faster joins
- Better parallelism
- Reduced scan work
The Magic of Joins & Parallelism
Bucketing shines when you have two large tables (e.g., orders and order_items) both bucketed by the same key (order_id) into the same number of buckets.
1. Co-located Joins
In a distributed system, joining two large tables often triggers a Shuffle Join. The engine has to move millions of rows across the network (like sending massive trucks between warehouses) so that matching IDs end up on the same machine. This network “traffic jam” is a silent killer of performance.
With Bucketing, you perform a Bucket-to-Bucket Join. Since the engine knows that orders in Bucket 0 can only match order_items in Bucket 0, and both are already sitting on the same disk, it avoids the shuffle entirely. No trucks, no traffic jams.
2. Independent Parallelism
Because buckets are independent, you can assign 32 workers to join 32 pairs of buckets simultaneously. No worker needs to talk to another.
Example pseudocode for a parallel join:
| |
Important constraint:
All buckets still live inside the same database or storage system.
Bucketing improves performance, not capacity.
Partitioning vs Bucketing
Partitioning = split by value
Example:
| |
Bucketing = split by hash
Example:
| |
In practice: partitioning reduces scan scope; bucketing improves join and parallel processing efficiency.
They are often used together.
Layer 3: Sharding (distribution across multiple databases)
Sharding distributes data across multiple database instances.
Example:
| |
Example routing:
| |
Why sharding exists:
- Increase total storage capacity
- Increase write throughput
- Increase connection capacity
Tradeoffs:
- Cross-shard queries are complex
- Transactions are harder
- Rebalancing is difficult
Bucketing vs Sharding (critical distinction)
Both use hash distribution.
Difference is scope.
Bucketing:
| |
Sharding:
| |
Bucketing improves performance inside one database.
Sharding increases capacity across multiple databases.
Layer 4: Replication (copy data)
Replication creates identical copies of data across nodes.
Example:
| |
Why replication exists:
- High availability
- Failover
- Read scaling
Replication does not increase write capacity.
Sharding vs Replication (the data distinction)
Both involve multiple servers, but they handle data differently.
Replication: every server has all the data (clones).
- Goal: Scaling reads and high availability.
- Limit: You are still limited by the storage capacity of one single machine.
Sharding: every server has a piece of the data (splits).
- Goal: Scaling writes and total storage capacity.
- Limit: Complex cross-shard queries and transactions.
Layer 5: Edge computing (move compute closer to users)
Edge computing moves compute to locations closer to users.
Typical flow:
| |
Edge improves latency and user experience, typically via caching and request routing.
Edge does not change database capacity.
How Real Systems Evolve
Typical progression:
- Indexes and query optimization
- Partitioning (when tables grow)
- Replication (when reads dominate)
- Caching and edge (when latency and bandwidth matter)
- Sharding (only when one primary cannot keep up)
Most systems never need step 5.
Final Mental Model
| |
Each term maps to a different bottleneck.
The fastest way to lose a design review is to use the right word for the wrong layer.
