Skip to main content

SQL Aggregation & Filtering

What You'll Learn

By the end of this page, you will be able to:

  • Use aggregate functions (COUNT, SUM, AVG, MIN, MAX) to summarize data
  • Write GROUP BY queries to produce subtotals per category
  • Explain the critical difference between WHERE and HAVING
  • Use DISTINCT and understand when it differs from GROUP BY
  • Write queries with multiple grouping levels and filtered aggregations
  • Avoid common pitfalls with NULLs, empty groups, and incorrect filtering

The Problem: Summarizing Data

Raw data lives in rows — one row per event, per order, per employee. But business questions ask about groups, not individual rows:

  • "What is the average salary per department?"
  • "How many orders did each customer place last month?"
  • "Which departments have more than 5 employees?"

You need a way to collapse many rows into summary rows. That's what aggregation does.

Sample Tables

All examples use these tables:

CREATE TABLE departments (
dept_id INT PRIMARY KEY,
dept_name VARCHAR(50)
);

INSERT INTO departments VALUES
(10, 'Engineering'),
(20, 'Marketing'),
(30, 'Sales'),
(40, 'HR');

CREATE TABLE employees (
id INT PRIMARY KEY,
name VARCHAR(50),
dept_id INT REFERENCES departments(dept_id),
salary DECIMAL(10,2),
hire_date DATE
);

INSERT INTO employees VALUES
(1, 'Alice', 10, 90000, '2020-01-15'),
(2, 'Bob', 20, 75000, '2021-03-20'),
(3, 'Carol', 10, 85000, '2022-06-10'),
(4, 'Dave', 30, 70000, '2023-09-01'),
(5, 'Eve', 10, 95000, '2019-11-05'),
(6, 'Frank', 20, 78000, '2024-02-14'),
(7, 'Grace', 30, 72000, '2023-07-22'),
(8, 'Hank', NULL, 65000, '2024-01-10');

Key observations:

  • Employee 8 (Hank) has no department (dept_id is NULL)
  • Department 40 (HR) has no employees
  • Engineering has 3 employees, Marketing and Sales each have 2

Aggregate Functions

Aggregate functions compute a single value from a set of input rows:

FunctionReturnsNULL Handling
COUNT(*)Total number of rowsCounts all rows, including NULLs
COUNT(col)Number of non-NULL values in colSkips NULLs
SUM(col)Sum of non-NULL valuesSkips NULLs, returns NULL if no rows
AVG(col)Average of non-NULL valuesSkips NULLs
MIN(col)Minimum non-NULL valueSkips NULLs
MAX(col)Maximum non-NULL valueSkips NULLs
-- Aggregate across the entire table
SELECT
COUNT(*) AS total_employees,
COUNT(dept_id) AS with_department,
AVG(salary) AS avg_salary,
MIN(salary) AS min_salary,
MAX(salary) AS max_salary,
SUM(salary) AS total_payroll
FROM employees;

Result:

total_employeeswith_departmentavg_salarymin_salarymax_salarytotal_payroll
8778750.0065000.0095000.00630000.00
tip

COUNT(*) counts all rows. COUNT(column) counts only rows where column is not NULL. This distinction matters when your column contains NULLs — Hank has no department, so COUNT(dept_id) returns 7, not 8.

GROUP BY: Grouping Rows

GROUP BY splits rows into groups based on column values, then applies aggregate functions to each group independently.

SELECT
dept_id,
COUNT(*) AS employee_count,
AVG(salary) AS avg_salary,
SUM(salary) AS total_salary
FROM employees
GROUP BY dept_id;

Result:

dept_idemployee_countavg_salarytotal_salary
10390000.00270000.00
20276500.00153000.00
30271000.00142000.00
NULL165000.0065000.00

Rules of GROUP BY

  1. Every column in SELECT (that is not aggregated) must appear in GROUP BY:
-- Valid: dept_id is in GROUP BY
SELECT dept_id, AVG(salary)
FROM employees
GROUP BY dept_id;

-- INVALID in most databases: name is not in GROUP BY
-- (PostgreSQL/SQL Server will error; MySQL may allow it with unpredictable results)
SELECT dept_id, name, AVG(salary)
FROM employees
GROUP BY dept_id;
  1. NULL forms its own group — all rows with NULL in the grouping column are grouped together.

  2. You can group by multiple columns:

SELECT dept_id, hire_date
FROM employees
GROUP BY dept_id, hire_date;

GROUP BY with JOIN

Combine grouping with joins to produce reports across related tables:

SELECT
d.dept_name,
COUNT(*) AS employee_count,
AVG(e.salary) AS avg_salary
FROM employees e
INNER JOIN departments d ON e.dept_id = d.dept_id
GROUP BY d.dept_name;

Result:

dept_nameemployee_countavg_salary
Engineering390000.00
Marketing276500.00
Sales271000.00
tip

Notice that Hank (no department) and HR (no employees) are absent. INNER JOIN excludes both. Use LEFT JOIN if you want to include all employees or all departments.

WHERE vs HAVING

This is the most common interview question about aggregation. The key insight is the order of execution:

ClauseFiltersWhen it runsCan use aggregates?
WHEREIndividual rowsBefore groupingNo
HAVINGGroups (after aggregation)After groupingYes

WHERE: Filter Before Grouping

-- "Average salary per department for employees hired after 2022"
SELECT
dept_id,
COUNT(*) AS employee_count,
AVG(salary) AS avg_salary
FROM employees
WHERE hire_date >= '2023-01-01'
GROUP BY dept_id;

Result:

dept_idemployee_countavg_salary
30271000.00
NULL165000.00

Only employees hired after 2022 are included before grouping happens. Engineering (Alice, Carol, Eve all hired before 2023) disappears entirely.

HAVING: Filter After Grouping

-- "Departments with more than 1 employee"
SELECT
dept_id,
COUNT(*) AS employee_count,
AVG(salary) AS avg_salary
FROM employees
GROUP BY dept_id
HAVING COUNT(*) > 1;

Result:

dept_idemployee_countavg_salary
10390000.00
20276500.00
30271000.00

All rows are grouped first, then groups with COUNT(*) <= 1 (Hank's NULL group) are removed.

Combining WHERE and HAVING

-- "Departments where employees hired after 2020 have an average salary over 72000"
SELECT
dept_id,
COUNT(*) AS employee_count,
AVG(salary) AS avg_salary
FROM employees
WHERE hire_date >= '2020-01-01' -- filter rows first
GROUP BY dept_id -- then group
HAVING AVG(salary) > 72000; -- then filter groups

Result:

dept_idemployee_countavg_salary
10390000.00
20276500.00

Execution breakdown:

  1. FROM — all 8 employees
  2. WHERE — remove Eve (hired 2019-11-05) → 7 rows remain
  3. GROUP BY — group the 7 rows by dept_id
  4. HAVING — keep only groups where AVG(salary) > 72000
  5. SELECT — output the final columns

DISTINCT vs GROUP BY

Both can produce unique values, but they serve different purposes:

-- Get unique dept_ids (both produce the same result)
SELECT DISTINCT dept_id FROM employees;
SELECT dept_id FROM employees GROUP BY dept_id;

Result (both queries):

dept_id
10
20
30
NULL

When they differ

AspectDISTINCTGROUP BY
PurposeRemove duplicate rowsAggregate data per group
Aggregate functionsCannot useCan use (COUNT, SUM, AVG, etc.)
Multiple columnsDeduplicates across all listed columnsGroups by all listed columns
PerformanceOften similarOften similar (same execution plan)
FilteringNo group-level filteringSupports HAVING

Use DISTINCT when:

  • You only need unique values without any aggregation
-- "Which departments have employees?"
SELECT DISTINCT dept_id FROM employees;

Use GROUP BY when:

  • You need to aggregate data per group
-- "How many employees per department?"
SELECT dept_id, COUNT(*) FROM employees GROUP BY dept_id;
  • You need to filter groups with HAVING
-- "Departments with more than 2 employees"
SELECT dept_id FROM employees GROUP BY dept_id HAVING COUNT(*) > 2;
tip

Under the hood, most databases produce the same execution plan for SELECT DISTINCT col and SELECT col GROUP BY col. Use whichever communicates your intent more clearly.

SQL Clause Execution Order

Understanding the logical order of SQL clause execution helps you write correct queries:

OrderClausePurposeCan use aliases?Can use aggregates?
1FROMIdentify source tablesNoNo
2JOINCombine tablesNoNo
3WHEREFilter individual rowsNoNo
4GROUP BYGroup rowsNo (in most DBs)No
5HAVINGFilter groupsNoYes
6SELECTCompute output columnsN/AYes
7ORDER BYSort resultsYesYes
8LIMITRestrict row countNoNo
warning

Aliases defined in SELECT are not available in WHERE or HAVING — because SELECT runs after WHERE and HAVING. They are available in ORDER BY (which runs last).

Advanced Patterns

COUNT with conditionals

Count rows matching a condition within a group:

SELECT
dept_id,
COUNT(*) AS total,
COUNT(CASE WHEN salary > 80000 THEN 1 END) AS high_earners,
COUNT(CASE WHEN salary <= 80000 THEN 1 END) AS standard_earners
FROM employees
WHERE dept_id IS NOT NULL
GROUP BY dept_id;

Result:

dept_idtotalhigh_earnersstandard_earners
10330
20211
30202

This is often faster than writing multiple subqueries.

GROUP BY with multiple columns

SELECT
d.dept_name,
EXTRACT(YEAR FROM e.hire_date) AS hire_year,
COUNT(*) AS hires_that_year
FROM employees e
INNER JOIN departments d ON e.dept_id = d.dept_id
GROUP BY d.dept_name, EXTRACT(YEAR FROM e.hire_date)
ORDER BY d.dept_name, hire_year;

Result:

dept_namehire_yearhires_that_year
Engineering20191
Engineering20201
Engineering20221
Marketing20211
Marketing20241
Sales20232

Aggregates with NULLs

SELECT
COUNT(*) AS count_all,
COUNT(dept_id) AS count_dept,
SUM(salary) AS sum_salary,
AVG(salary) AS avg_salary
FROM employees;

Result:

count_allcount_deptsum_salaryavg_salary
87630000.0078750.00
  • COUNT(*) = 8 (counts all rows)
  • COUNT(dept_id) = 7 (skips Hank's NULL)
  • AVG(salary) = 78750.00 (average of all 8 non-NULL salaries)
warning

AVG ignores NULL values, not treating them as zero. If you want NULLs counted as zero, use COALESCE:

-- AVG treats NULLs as "not exist"
-- COALESCE forces NULLs to be 0, changing the average
SELECT AVG(COALESCE(bonus, 0)) FROM employees;

Common Pitfalls

  • Using WHERE with aggregate functionsWHERE COUNT(*) > 1 is invalid. Use HAVING COUNT(*) > 1 instead.
-- WRONG
SELECT dept_id, COUNT(*)
FROM employees
WHERE COUNT(*) > 1
GROUP BY dept_id;

-- CORRECT
SELECT dept_id, COUNT(*)
FROM employees
GROUP BY dept_id
HAVING COUNT(*) > 1;
  • SELECT column not in GROUP BY — Every non-aggregated column in SELECT must be in GROUP BY. Most databases will error.
-- WRONG: name is not aggregated and not in GROUP BY
SELECT dept_id, name, AVG(salary)
FROM employees
GROUP BY dept_id;

-- CORRECT: either aggregate name or add to GROUP BY
SELECT dept_id, MAX(name) AS representative_name, AVG(salary)
FROM employees
GROUP BY dept_id;
  • HAVING without GROUP BY — Using HAVING without GROUP BY treats the entire table as one group. This is valid but confusing — prefer WHERE.

  • Filtering in the wrong clause — Using HAVING when WHERE suffices. HAVING is evaluated after grouping, so it processes more data than necessary.

-- INEFFICIENT: HAVING filters after grouping all rows
SELECT dept_id, COUNT(*)
FROM employees
HAVING dept_id IS NOT NULL
GROUP BY dept_id;

-- BETTER: WHERE filters before grouping (fewer rows to group)
SELECT dept_id, COUNT(*)
FROM employees
WHERE dept_id IS NOT NULL
GROUP BY dept_id;
  • COUNT(*) vs COUNT(col) confusionCOUNT(*) counts all rows; COUNT(col) skips NULLs. Know which one you need.

Best Practices

  1. Filter early with WHERE — reduce the number of rows before grouping for better performance
  2. Use HAVING only for aggregate conditions — if the condition doesn't use an aggregate, it belongs in WHERE
  3. Always GROUP BY all non-aggregated columns — don't rely on loose MySQL behavior
  4. Use meaningful aliasesCOUNT(*) AS employee_count is clearer than bare COUNT(*)
  5. Index columns used in GROUP BY — grouping often involves sorting or hashing
  6. Check execution plans — use EXPLAIN to verify the database is using indexes and not scanning the entire table
  7. Consider materialized views — for expensive aggregation queries run frequently

Interview Questions

1. What is the difference between WHERE and HAVING?

WHEREHAVING
FiltersIndividual rowsGroups (after aggregation)
RunsBefore GROUP BYAfter GROUP BY
Can use aggregates?NoYes
Use case"Filter employees with salary > 80000""Filter departments with avg salary > 80000"
-- WHERE: filter rows before grouping
SELECT dept_id, AVG(salary)
FROM employees
WHERE salary > 70000
GROUP BY dept_id;

-- HAVING: filter groups after grouping
SELECT dept_id, AVG(salary)
FROM employees
GROUP BY dept_id
HAVING AVG(salary) > 70000;

The first query excludes individual employees earning 70000 or less, then averages the rest. The second query averages all employees per department, then keeps departments where that average exceeds 70000.

2. What is the difference between COUNT(*) and COUNT(column)?

  • COUNT(*) counts all rows, including those with NULL values
  • COUNT(column) counts only rows where column is **not NULL`
SELECT
COUNT(*) AS total_rows, -- 8
COUNT(dept_id) AS with_dept -- 7 (Hank has NULL dept_id)
FROM employees;

3. What is the difference between DISTINCT and GROUP BY?

Both can return unique values:

-- These produce identical results
SELECT DISTINCT dept_id FROM employees;
SELECT dept_id FROM employees GROUP BY dept_id;

But GROUP BY can do more — it supports aggregation and HAVING:

-- GROUP BY with aggregation (DISTINCT can't do this)
SELECT dept_id, COUNT(*), AVG(salary)
FROM employees
GROUP BY dept_id;

Use DISTINCT for simple deduplication. Use GROUP BY when you need aggregation or group-level filtering.

4. Write a query to find departments with more than 2 employees earning above 75000.

This requires both WHERE and HAVING:

SELECT
d.dept_name,
COUNT(*) AS high_earner_count,
AVG(e.salary) AS avg_high_salary
FROM employees e
INNER JOIN departments d ON e.dept_id = d.dept_id
WHERE e.salary > 75000 -- filter individual rows first
GROUP BY d.dept_name -- then group by department
HAVING COUNT(*) > 2; -- then filter groups

5. What is the order of execution of SQL clauses?

FROM → JOIN → WHERE → GROUP BY → HAVING → SELECT → ORDER BY → LIMIT

This explains why:

  • You can't use SELECT aliases in WHERE (WHERE runs first)
  • You can't use aggregate functions in WHERE (aggregation hasn't happened yet)
  • You can use SELECT aliases in ORDER BY (ORDER BY runs last)

6. What happens when you GROUP BY a column that contains NULLs?

All rows with NULL in the grouping column are placed into one group:

SELECT dept_id, COUNT(*)
FROM employees
GROUP BY dept_id;

This produces a row for dept_id = NULL with count 1 (Hank). NULL is treated as a distinct grouping value.

7. How do you count rows matching a condition within a group?

Use COUNT(CASE WHEN ... THEN 1 END) or the FILTER clause (PostgreSQL):

-- Standard SQL (works everywhere)
SELECT
dept_id,
COUNT(*) AS total,
COUNT(CASE WHEN salary > 80000 THEN 1 END) AS high_earners
FROM employees
GROUP BY dept_id;

-- PostgreSQL-specific (cleaner syntax)
SELECT
dept_id,
COUNT(*) AS total,
COUNT(*) FILTER (WHERE salary > 80000) AS high_earners
FROM employees
GROUP BY dept_id;

Learn More