Brian Tsan
UC Merced
btsan@ucmerced.edu
Advisor: Professor Florin Rusu
SELECT COUNT(*) FROM customers c JOIN orders o ON c.customer_id = o.customer_id
WHERE c.age = 35 AND c.city = 'SF'
| Input Embeddings | |
|---|---|
| age | h(35) = 7 |
| city | h('SF') = 12 |
| customer_id | [MASK] |
the background
Find all students enrolled in special seminar courses
SELECT s.name, e.student_id, c.course_name FROM Students s JOIN Enrollments e ON s.student_id = e.student_id JOIN SpecialCourses c ON e.course_id = c.course_id
| ID | Name |
|---|---|
| 1 | Alice |
| 2 | Bob |
| 3 | Carol |
| 4 | Dave |
| 5 | Eve |
| SID | CID |
|---|---|
| 1 | 101 |
| 2 | 101 |
| 1 | 201 |
| 3 | 201 |
| 4 | 202 |
| 5 | 202 |
| 2 | 203 |
| 3 | 203 |
| 4 | 204 |
| 5 | 205 |
| ID | Course |
|---|---|
| 101 | CS101 |
| 201 | CS102 |
Step 1: Students ⋈ Enrollments
| ID | Name |
|---|---|
| 1 | Alice |
| 2 | Bob |
| 3 | Carol |
| 4 | Dave |
| 5 | Eve |
| SID | CID |
|---|---|
| 1 | 101 |
| 2 | 101 |
| 1 | 201 |
| 3 | 201 |
| 4 | 202 |
| 5 | 202 |
| 2 | 203 |
| 3 | 203 |
| 4 | 204 |
| 5 | 205 |
Cost: 5 × 10 = 50 comparisons
Step 1: Students ⋈ Enrollments
| ID | Name |
|---|---|
| 1 | Alice |
| 2 | Bob |
| 3 | Carol |
| 4 | Dave |
| 5 | Eve |
| SID | CID |
|---|---|
| 1 | 101 |
| 2 | 101 |
| 1 | 201 |
| 3 | 201 |
| 4 | 202 |
| 5 | 202 |
| 2 | 203 |
| 3 | 203 |
| 4 | 204 |
| 5 | 205 |
Cost: 5 × 10 = 50 comparisons
Step 1: Students ⋈ Enrollments
| ID | Name |
|---|---|
| 1 | Alice |
| 2 | Bob |
| 3 | Carol |
| 4 | Dave |
| 5 | Eve |
| SID | CID |
|---|---|
| 1 | 101 |
| 2 | 101 |
| 1 | 201 |
| 3 | 201 |
| 4 | 202 |
| 5 | 202 |
| 2 | 203 |
| 3 | 203 |
| 4 | 204 |
| 5 | 205 |
Cost: 5 × 10 = 50 comparisons
Step 1 Result: 10 intermediate rows
| Name | SID | CID |
|---|---|---|
| Alice | 1 | 101 |
| Bob | 2 | 101 |
| Alice | 1 | 201 |
| Carol | 3 | 201 |
| Dave | 4 | 202 |
| Eve | 5 | 202 |
| Bob | 2 | 203 |
| Carol | 3 | 203 |
| Dave | 4 | 204 |
| Eve | 5 | 205 |
Step 2: Result ⋈ SpecialCourses
| Name | CID |
|---|---|
| Alice | 101 |
| Bob | 101 |
| Alice | 201 |
| Carol | 201 |
| Dave | 202 |
| Eve | 202 |
| Bob | 203 |
| Carol | 203 |
| Dave | 204 |
| Eve | 205 |
| ID | Course |
|---|---|
| 101 | CS101 |
| 201 | CS102 |
Cost: 10 × 2 = 20 comparisons
Step 2: Result ⋈ SpecialCourses
| Name | CID |
|---|---|
| Alice | 101 |
| Bob | 101 |
| Alice | 201 |
| Carol | 201 |
| Dave | 202 |
| Eve | 202 |
| Bob | 203 |
| Carol | 203 |
| Dave | 204 |
| Eve | 205 |
| ID | Course |
|---|---|
| 101 | CS101 |
| 201 | CS102 |
Cost: 10 × 2 = 20 comparisons
Final Result: 4 rows
| Name | Course |
|---|---|
| Alice | CS101 |
| Bob | CS101 |
| Alice | CS102 |
| Carol | CS102 |
Step 1: SpecialCourses ⋈ Enrollments
| ID | Course |
|---|---|
| 101 | CS101 |
| 201 | CS102 |
| SID | CID |
|---|---|
| 1 | 101 |
| 2 | 101 |
| 1 | 201 |
| 3 | 201 |
| 4 | 202 |
| 5 | 202 |
| 2 | 203 |
| 3 | 203 |
| 4 | 204 |
| 5 | 205 |
Cost: 2 × 10 = 20 comparisons
Step 1: SpecialCourses ⋈ Enrollments
| ID | Course |
|---|---|
| 101 | CS101 |
| 201 | CS102 |
| SID | CID |
|---|---|
| 1 | 101 |
| 2 | 101 |
| 1 | 201 |
| 3 | 201 |
| 4 | 202 |
| 5 | 202 |
| 2 | 203 |
| 3 | 203 |
| 4 | 204 |
| 5 | 205 |
Cost: 2 × 10 = 20 comparisons
Step 1: SpecialCourses ⋈ Enrollments
| ID | Course |
|---|---|
| 101 | CS101 |
| 201 | CS102 |
| SID | CID |
|---|---|
| 1 | 101 |
| 2 | 101 |
| 1 | 201 |
| 3 | 201 |
| 4 | 202 |
| 5 | 202 |
| 2 | 203 |
| 3 | 203 |
| 4 | 204 |
| 5 | 205 |
Cost: 2 × 10 = 20 comparisons
Step 1 Result: Only 4 intermediate rows!
| SID | Course |
|---|---|
| 1 | CS101 |
| 2 | CS101 |
| 1 | CS102 |
| 3 | CS102 |
Step 2: Result ⋈ Students
| SID | Course |
|---|---|
| 1 | CS101 |
| 2 | CS101 |
| 1 | CS102 |
| 3 | CS102 |
| ID | Name |
|---|---|
| 1 | Alice |
| 2 | Bob |
| 3 | Carol |
| 4 | Dave |
| 5 | Eve |
Cost: 4 × 5 = 10 comparisons
Much smaller intermediate result!
Step 2: Result ⋈ Students
| SID | Course |
|---|---|
| 1 | CS101 |
| 2 | CS101 |
| 1 | CS102 |
| 3 | CS102 |
| ID | Name |
|---|---|
| 1 | Alice |
| 2 | Bob |
| 3 | Carol |
| 4 | Dave |
| 5 | Eve |
Cost: 4 × 5 = 10 comparisons
Much smaller intermediate result!
Final Result: 4 rows
| Name | Course |
|---|---|
| Alice | CS101 |
| Bob | CS101 |
| Alice | CS102 |
| Carol | CS102 |
Plan B is 2.3× faster
The Challenge: How can the query optimizer choose the best join order before execution?
→ We need cardinality estimation to predict intermediate result sizes!
Example: Wrong estimate leads to wrong join order
10x slower query execution!
Traditional estimators assume columns are independent
Reality: Data is often correlated!
This assumption fails → Massive estimation errors
Sketch: Compact probabilistic data structure
But... they're query-specific 😕
AGMS Sketch: Elements compete in a "tug-of-war"
Each color assigned to +1 or -1 team
Hash function: $\xi: \mathbb{R} \rightarrow \{\pm 1\}$
The reason tug-of-war works:
$\mathbb{E}[\xi(i)\xi(j)] = \begin{cases} 0 & \text{if } i \neq j \\ 1 & \text{if } i = j \end{cases}$
To estimate frequency of value $j$:
$f_j \approx s \cdot \xi(j)$
where $s = \sum_{i \in X} \xi(i)$ is the sketch
Unbiased! $\mathbb{E}[s \cdot \xi(j)] = f_j$
Variance: $\text{Var}[s \cdot \xi(j)] = F_2(X) - f_j^2$
AGMS has high variance...
so you need to play a lot of tug-of-war!
Use width-$w$ array instead of single counter
Each element goes to ONE counter → 1/w update time!
How to handle range predicates?
Decompose range into dyadic intervals!
Any range decomposes into ≤ $2(\log_2(b-a) - 1)$ intervals
For join $X \bowtie Y$:
$|X \bowtie Y| \approx S_X \cdot S_Y$
Dot product of sketch vectors!
This generalizes to sktech tensors for more joins
But tensor size grows exponentially...
Avoiding Tensors with Circular Convolution
Simplified: Always increment (no $\xi$ hash)
Trade-off: Biased (overestimates), but simpler & faster
Track maximum frequency per bucket
$\max_i f_i \leq \sum_i f_i$ → Tighter than Count-Min!
| Sketch | Linear? | Unbiased? | Space | Update Time |
|---|---|---|---|---|
| AGMS | ✅ Yes | ✅ Yes | O(dw) | O(dw) |
| Fast-AGMS | ✅ Yes | ✅ Yes | O(dw) | O(2d) |
| Count-Min | ✅ Yes | ❌ No (upper bound) | O(dw) | O(d) |
| Bound | ❌ No | ❌ No (tighter bound) | O(dw) | O(d) |
d = # independent sketches, w = sketch width
Issue 1: Selections + Joins
→ Would need exponential sketch tensors
Issue 2: Query-specific construction
→ Need to scan entire dataset for each query
Issue 3: Many queries = many sketches
→ Storage explosion for diverse workloads
Idea: Use ML models to generate sketches on-demand
No data scan required! ✨
Both approaches: Learn distribution → Generate sketches
Example Query:
SELECT COUNT(*)
FROM orders o JOIN customers c ON o.customer_id = c.id
WHERE o.year = 2024 AND c.region = 'West'
Need sketches for σyear=2024(orders) and σregion='West'(customers)
The exact sketching pipeline (top) is impractical.
Using Transformers to Approximate Count-Min Sketches
A Count-Min sketch can be expressed as:
Where:
Idea: Train a model to estimate p for ANY predicate φ
For a hash function h : ℝ → {1, ..., w}:
The i-th counter in the sketch:
Therefore: sketch = p × |σφ(T)|
SELECT COUNT(*) FROM customers c JOIN orders o ON c.customer_id = o.customer_id
WHERE c.age = 35 AND c.city = 'SF'
| Input Embeddings | |
|---|---|
| age | h(35) = 7 |
| city | h('SF') = 12 |
| customer_id | [MASK] |
Essential for approximating sketches with arbitrary filter conditions
Start with actual database table rows
| customer_id | age (point pred) |
city (point pred) |
year (range pred) |
|---|---|---|---|
| 1001 | 35 | 'SF' | 2022 |
| 1002 | 42 | 'NY' | 2019 |
| 1003 | 28 | 'LA' | 2021 |
| 1004 | 55 | 'SF' | 2023 |
Goal: Transform this data to train BERT to predict sketches for any filter condition
| Stage | Point Predicate (e.g., city) | Range Predicate (e.g., year) |
|---|---|---|
| 1. Raw Value | 'SF' | 2022 |
| 2. Dyadic Annotation | — | [2022,2023)2⁰, [2022,2024)2¹, ... |
| 3. Hash | h('SF') = 12 | h0 = 45, h1 = 28, ... |
| 4. Embedding | Map each hash to learnable vector in ℝd | |
| 5. Training | BERT masked language modeling: [h₁, h₂, [MASK], h₄] → predict h₃ | |
For query: SELECT * FROM A JOIN B ON A.x = B.y WHERE ...
SELECT COUNT(*) FROM customers c JOIN orders o ON c.customer_id = o.customer_id WHERE c.age = 35 AND c.city = 'SF' AND o.year = 2023
Datasets:
Baselines:
Approx. faster than exact (pushdown) sketches!
Scaling Beyond Transformer Limitations
Trees made of:
Training left, inference right
Query for
the distribution of $P(X, Z)$ where $X=x_2$
Decompose complex joint distributions as linear combinations:
Advantages for Relational Data:
Instead of storing probability distributions in leaf nodes...
Store sketches in leaf nodes!
Conjecture: Approximation error is bounded by independence assumption
Worst-case: Single product node (full independence assumption)
Tighter independence/clustering thresholds → Closer to exact sketches
Observation: Overestimation is safer
Two Techniques for Upward Bias:
Datasets:
Baselines:
Takeaway: Just cluster, even a little bit, to drastically reduce error!
| Metric | Exact Sketches | Approximate Sketches |
|---|---|---|
| Sketch Construction Time | 737s (JOB-light) 155s (Stats-CEB) |
4.5s (JOB-light) 17.3s (Stats-CEB) |
| Space Required | 473 MB (JOB-light) 3.1 GB (Stats-CEB) |
1.2 GB model 676 MB model |
| Query Execution Time | 2.88 hrs (JOB-light) 9.16 hrs (Stats-CEB) |
2.88 hrs (0!) 9.31 hrs (+0.15) |
Within 3% of exact sketches! But much faster to compute.
| Method | Model | JOB-light | Stats-CEB |
|---|---|---|---|
| Approximate Sketches | BERT | 148 min/epoch* | N/A |
| Sketched SPNs | SPN | 21 min | 2 min |
| DeepDB | SPN | 30 min | 55 min |
| BayesCard | Bayes Net | 6 min | 2 min |
| FactorJoin | Bayes Net | 4 min | 1 min |
*GPU required; others CPU-only
Sketched SPNs competitive with other per-relation methods
Exact F-AGMS would be the fastest, if not for construction time!
Makes sketches practical
1. Problem: Cardinality estimation is critical
2. Challenge: Joins are hard to estimate
3. Sketches: Fast, accurate, but limited
4. Our Solution: Learn distribution → Generate sketches
5. Results: Fast generation, high accuracy
Enables sketches for query optimization
without per-query construction!
Brian Tsan
btsan@ucmerced.edu
Advisor: Professor Florin Rusu
UC Merced
Code & demos available at: github.com/btsan