SQL Aggregation & Filtering
What You'll Learn
By the end of this page, you will be able to:
- Use aggregate functions (
COUNT,SUM,AVG,MIN,MAX) to summarize data - Write GROUP BY queries to produce subtotals per category
- Explain the critical difference between WHERE and HAVING
- Use DISTINCT and understand when it differs from GROUP BY
- Write queries with multiple grouping levels and filtered aggregations
- Avoid common pitfalls with NULLs, empty groups, and incorrect filtering
The Problem: Summarizing Data
Raw data lives in rows — one row per event, per order, per employee. But business questions ask about groups, not individual rows:
- "What is the average salary per department?"
- "How many orders did each customer place last month?"
- "Which departments have more than 5 employees?"
You need a way to collapse many rows into summary rows. That's what aggregation does.
Sample Tables
All examples use these tables:
CREATE TABLE departments (
dept_id INT PRIMARY KEY,
dept_name VARCHAR(50)
);
INSERT INTO departments VALUES
(10, 'Engineering'),
(20, 'Marketing'),
(30, 'Sales'),
(40, 'HR');
CREATE TABLE employees (
id INT PRIMARY KEY,
name VARCHAR(50),
dept_id INT REFERENCES departments(dept_id),
salary DECIMAL(10,2),
hire_date DATE
);
INSERT INTO employees VALUES
(1, 'Alice', 10, 90000, '2020-01-15'),
(2, 'Bob', 20, 75000, '2021-03-20'),
(3, 'Carol', 10, 85000, '2022-06-10'),
(4, 'Dave', 30, 70000, '2023-09-01'),
(5, 'Eve', 10, 95000, '2019-11-05'),
(6, 'Frank', 20, 78000, '2024-02-14'),
(7, 'Grace', 30, 72000, '2023-07-22'),
(8, 'Hank', NULL, 65000, '2024-01-10');
Key observations:
- Employee 8 (Hank) has no department (
dept_idis NULL) - Department 40 (HR) has no employees
- Engineering has 3 employees, Marketing and Sales each have 2
Aggregate Functions
Aggregate functions compute a single value from a set of input rows:
| Function | Returns | NULL Handling |
|---|---|---|
COUNT(*) | Total number of rows | Counts all rows, including NULLs |
COUNT(col) | Number of non-NULL values in col | Skips NULLs |
SUM(col) | Sum of non-NULL values | Skips NULLs, returns NULL if no rows |
AVG(col) | Average of non-NULL values | Skips NULLs |
MIN(col) | Minimum non-NULL value | Skips NULLs |
MAX(col) | Maximum non-NULL value | Skips NULLs |
-- Aggregate across the entire table
SELECT
COUNT(*) AS total_employees,
COUNT(dept_id) AS with_department,
AVG(salary) AS avg_salary,
MIN(salary) AS min_salary,
MAX(salary) AS max_salary,
SUM(salary) AS total_payroll
FROM employees;
Result:
| total_employees | with_department | avg_salary | min_salary | max_salary | total_payroll |
|---|---|---|---|---|---|
| 8 | 7 | 78750.00 | 65000.00 | 95000.00 | 630000.00 |
COUNT(*) counts all rows. COUNT(column) counts only rows where column is not NULL. This distinction matters when your column contains NULLs — Hank has no department, so COUNT(dept_id) returns 7, not 8.
GROUP BY: Grouping Rows
GROUP BY splits rows into groups based on column values, then applies aggregate functions to each group independently.
SELECT
dept_id,
COUNT(*) AS employee_count,
AVG(salary) AS avg_salary,
SUM(salary) AS total_salary
FROM employees
GROUP BY dept_id;
Result:
| dept_id | employee_count | avg_salary | total_salary |
|---|---|---|---|
| 10 | 3 | 90000.00 | 270000.00 |
| 20 | 2 | 76500.00 | 153000.00 |
| 30 | 2 | 71000.00 | 142000.00 |
| NULL | 1 | 65000.00 | 65000.00 |
Rules of GROUP BY
- Every column in SELECT (that is not aggregated) must appear in GROUP BY:
-- Valid: dept_id is in GROUP BY
SELECT dept_id, AVG(salary)
FROM employees
GROUP BY dept_id;
-- INVALID in most databases: name is not in GROUP BY
-- (PostgreSQL/SQL Server will error; MySQL may allow it with unpredictable results)
SELECT dept_id, name, AVG(salary)
FROM employees
GROUP BY dept_id;
-
NULL forms its own group — all rows with NULL in the grouping column are grouped together.
-
You can group by multiple columns:
SELECT dept_id, hire_date
FROM employees
GROUP BY dept_id, hire_date;
GROUP BY with JOIN
Combine grouping with joins to produce reports across related tables:
SELECT
d.dept_name,
COUNT(*) AS employee_count,
AVG(e.salary) AS avg_salary
FROM employees e
INNER JOIN departments d ON e.dept_id = d.dept_id
GROUP BY d.dept_name;
Result:
| dept_name | employee_count | avg_salary |
|---|---|---|
| Engineering | 3 | 90000.00 |
| Marketing | 2 | 76500.00 |
| Sales | 2 | 71000.00 |
Notice that Hank (no department) and HR (no employees) are absent. INNER JOIN excludes both. Use LEFT JOIN if you want to include all employees or all departments.
WHERE vs HAVING
This is the most common interview question about aggregation. The key insight is the order of execution:
| Clause | Filters | When it runs | Can use aggregates? |
|---|---|---|---|
| WHERE | Individual rows | Before grouping | No |
| HAVING | Groups (after aggregation) | After grouping | Yes |
WHERE: Filter Before Grouping
-- "Average salary per department for employees hired after 2022"
SELECT
dept_id,
COUNT(*) AS employee_count,
AVG(salary) AS avg_salary
FROM employees
WHERE hire_date >= '2023-01-01'
GROUP BY dept_id;
Result:
| dept_id | employee_count | avg_salary |
|---|---|---|
| 30 | 2 | 71000.00 |
| NULL | 1 | 65000.00 |
Only employees hired after 2022 are included before grouping happens. Engineering (Alice, Carol, Eve all hired before 2023) disappears entirely.
HAVING: Filter After Grouping
-- "Departments with more than 1 employee"
SELECT
dept_id,
COUNT(*) AS employee_count,
AVG(salary) AS avg_salary
FROM employees
GROUP BY dept_id
HAVING COUNT(*) > 1;
Result:
| dept_id | employee_count | avg_salary |
|---|---|---|
| 10 | 3 | 90000.00 |
| 20 | 2 | 76500.00 |
| 30 | 2 | 71000.00 |
All rows are grouped first, then groups with COUNT(*) <= 1 (Hank's NULL group) are removed.
Combining WHERE and HAVING
-- "Departments where employees hired after 2020 have an average salary over 72000"
SELECT
dept_id,
COUNT(*) AS employee_count,
AVG(salary) AS avg_salary
FROM employees
WHERE hire_date >= '2020-01-01' -- filter rows first
GROUP BY dept_id -- then group
HAVING AVG(salary) > 72000; -- then filter groups
Result:
| dept_id | employee_count | avg_salary |
|---|---|---|
| 10 | 3 | 90000.00 |
| 20 | 2 | 76500.00 |
Execution breakdown:
- FROM — all 8 employees
- WHERE — remove Eve (hired 2019-11-05) → 7 rows remain
- GROUP BY — group the 7 rows by
dept_id - HAVING — keep only groups where
AVG(salary) > 72000 - SELECT — output the final columns
DISTINCT vs GROUP BY
Both can produce unique values, but they serve different purposes:
-- Get unique dept_ids (both produce the same result)
SELECT DISTINCT dept_id FROM employees;
SELECT dept_id FROM employees GROUP BY dept_id;
Result (both queries):
| dept_id |
|---|
| 10 |
| 20 |
| 30 |
| NULL |
When they differ
| Aspect | DISTINCT | GROUP BY |
|---|---|---|
| Purpose | Remove duplicate rows | Aggregate data per group |
| Aggregate functions | Cannot use | Can use (COUNT, SUM, AVG, etc.) |
| Multiple columns | Deduplicates across all listed columns | Groups by all listed columns |
| Performance | Often similar | Often similar (same execution plan) |
| Filtering | No group-level filtering | Supports HAVING |
Use DISTINCT when:
- You only need unique values without any aggregation
-- "Which departments have employees?"
SELECT DISTINCT dept_id FROM employees;
Use GROUP BY when:
- You need to aggregate data per group
-- "How many employees per department?"
SELECT dept_id, COUNT(*) FROM employees GROUP BY dept_id;
- You need to filter groups with HAVING
-- "Departments with more than 2 employees"
SELECT dept_id FROM employees GROUP BY dept_id HAVING COUNT(*) > 2;
Under the hood, most databases produce the same execution plan for SELECT DISTINCT col and SELECT col GROUP BY col. Use whichever communicates your intent more clearly.
SQL Clause Execution Order
Understanding the logical order of SQL clause execution helps you write correct queries:
| Order | Clause | Purpose | Can use aliases? | Can use aggregates? |
|---|---|---|---|---|
| 1 | FROM | Identify source tables | No | No |
| 2 | JOIN | Combine tables | No | No |
| 3 | WHERE | Filter individual rows | No | No |
| 4 | GROUP BY | Group rows | No (in most DBs) | No |
| 5 | HAVING | Filter groups | No | Yes |
| 6 | SELECT | Compute output columns | N/A | Yes |
| 7 | ORDER BY | Sort results | Yes | Yes |
| 8 | LIMIT | Restrict row count | No | No |
Aliases defined in SELECT are not available in WHERE or HAVING — because SELECT runs after WHERE and HAVING. They are available in ORDER BY (which runs last).
Advanced Patterns
COUNT with conditionals
Count rows matching a condition within a group:
SELECT
dept_id,
COUNT(*) AS total,
COUNT(CASE WHEN salary > 80000 THEN 1 END) AS high_earners,
COUNT(CASE WHEN salary <= 80000 THEN 1 END) AS standard_earners
FROM employees
WHERE dept_id IS NOT NULL
GROUP BY dept_id;
Result:
| dept_id | total | high_earners | standard_earners |
|---|---|---|---|
| 10 | 3 | 3 | 0 |
| 20 | 2 | 1 | 1 |
| 30 | 2 | 0 | 2 |
This is often faster than writing multiple subqueries.
GROUP BY with multiple columns
SELECT
d.dept_name,
EXTRACT(YEAR FROM e.hire_date) AS hire_year,
COUNT(*) AS hires_that_year
FROM employees e
INNER JOIN departments d ON e.dept_id = d.dept_id
GROUP BY d.dept_name, EXTRACT(YEAR FROM e.hire_date)
ORDER BY d.dept_name, hire_year;
Result:
| dept_name | hire_year | hires_that_year |
|---|---|---|
| Engineering | 2019 | 1 |
| Engineering | 2020 | 1 |
| Engineering | 2022 | 1 |
| Marketing | 2021 | 1 |
| Marketing | 2024 | 1 |
| Sales | 2023 | 2 |
Aggregates with NULLs
SELECT
COUNT(*) AS count_all,
COUNT(dept_id) AS count_dept,
SUM(salary) AS sum_salary,
AVG(salary) AS avg_salary
FROM employees;
Result:
| count_all | count_dept | sum_salary | avg_salary |
|---|---|---|---|
| 8 | 7 | 630000.00 | 78750.00 |
COUNT(*)= 8 (counts all rows)COUNT(dept_id)= 7 (skips Hank's NULL)AVG(salary)= 78750.00 (average of all 8 non-NULL salaries)
AVG ignores NULL values, not treating them as zero. If you want NULLs counted as zero, use COALESCE:
-- AVG treats NULLs as "not exist"
-- COALESCE forces NULLs to be 0, changing the average
SELECT AVG(COALESCE(bonus, 0)) FROM employees;
Common Pitfalls
- Using WHERE with aggregate functions —
WHERE COUNT(*) > 1is invalid. UseHAVING COUNT(*) > 1instead.
-- WRONG
SELECT dept_id, COUNT(*)
FROM employees
WHERE COUNT(*) > 1
GROUP BY dept_id;
-- CORRECT
SELECT dept_id, COUNT(*)
FROM employees
GROUP BY dept_id
HAVING COUNT(*) > 1;
- SELECT column not in GROUP BY — Every non-aggregated column in SELECT must be in GROUP BY. Most databases will error.
-- WRONG: name is not aggregated and not in GROUP BY
SELECT dept_id, name, AVG(salary)
FROM employees
GROUP BY dept_id;
-- CORRECT: either aggregate name or add to GROUP BY
SELECT dept_id, MAX(name) AS representative_name, AVG(salary)
FROM employees
GROUP BY dept_id;
-
HAVING without GROUP BY — Using
HAVINGwithoutGROUP BYtreats the entire table as one group. This is valid but confusing — preferWHERE. -
Filtering in the wrong clause — Using HAVING when WHERE suffices. HAVING is evaluated after grouping, so it processes more data than necessary.
-- INEFFICIENT: HAVING filters after grouping all rows
SELECT dept_id, COUNT(*)
FROM employees
HAVING dept_id IS NOT NULL
GROUP BY dept_id;
-- BETTER: WHERE filters before grouping (fewer rows to group)
SELECT dept_id, COUNT(*)
FROM employees
WHERE dept_id IS NOT NULL
GROUP BY dept_id;
- COUNT(*) vs COUNT(col) confusion —
COUNT(*)counts all rows;COUNT(col)skips NULLs. Know which one you need.
Best Practices
- Filter early with WHERE — reduce the number of rows before grouping for better performance
- Use HAVING only for aggregate conditions — if the condition doesn't use an aggregate, it belongs in WHERE
- Always GROUP BY all non-aggregated columns — don't rely on loose MySQL behavior
- Use meaningful aliases —
COUNT(*) AS employee_countis clearer than bareCOUNT(*) - Index columns used in GROUP BY — grouping often involves sorting or hashing
- Check execution plans — use
EXPLAINto verify the database is using indexes and not scanning the entire table - Consider materialized views — for expensive aggregation queries run frequently
Interview Questions
1. What is the difference between WHERE and HAVING?
| WHERE | HAVING | |
|---|---|---|
| Filters | Individual rows | Groups (after aggregation) |
| Runs | Before GROUP BY | After GROUP BY |
| Can use aggregates? | No | Yes |
| Use case | "Filter employees with salary > 80000" | "Filter departments with avg salary > 80000" |
-- WHERE: filter rows before grouping
SELECT dept_id, AVG(salary)
FROM employees
WHERE salary > 70000
GROUP BY dept_id;
-- HAVING: filter groups after grouping
SELECT dept_id, AVG(salary)
FROM employees
GROUP BY dept_id
HAVING AVG(salary) > 70000;
The first query excludes individual employees earning 70000 or less, then averages the rest. The second query averages all employees per department, then keeps departments where that average exceeds 70000.
2. What is the difference between COUNT(*) and COUNT(column)?
COUNT(*)counts all rows, including those with NULL valuesCOUNT(column)counts only rows wherecolumnis **not NULL`
SELECT
COUNT(*) AS total_rows, -- 8
COUNT(dept_id) AS with_dept -- 7 (Hank has NULL dept_id)
FROM employees;