Group By
The Philosophy of Aggregation
Section titled “The Philosophy of Aggregation”The purpose of aggregation is to move from a granular, row-level view of data to a high-level summary. Think of the difference between a raw list of every sale transaction for a day and the final “Total Sales for August 25th” number. Aggregation is the process of creating that summary.
The cornerstone of all aggregation is the GROUP BY clause.
1. The Cornerstone: GROUP BY
Section titled “1. The Cornerstone: GROUP BY”-
Core Concept: The
GROUP BYclause takes all the rows passed to it by theWHEREclause and organizes them into “buckets.” Each bucket contains rows that have the same value for the column(s) specified in theGROUP BYclause. -
The Rule: Once you use
GROUP BY, yourSELECTstatement can now only contain two things:- The columns you used to group by (the “buckets”).
- Aggregate functions that calculate a value based on the rows inside each bucket.
-
Example: Let’s conceptually group our
Productstable.... FROM Products GROUP BY CategoryID;This creates a set of buckets: one forCategoryID = 1(Electronics), one forCategoryID = 2(Books), etc. TheProductsrows are placed into their respective buckets.Bucket: CategoryID 1 (Electronics)- Row: Laptop- Row: Wireless Mouse- Row: MonitorBucket: CategoryID 2 (Books)- Row: SQL Mastery- Row: The Phoenix ProjectNow, we can ask questions about each bucket.
2. The Essential Aggregate Functions
Section titled “2. The Essential Aggregate Functions”These functions operate on the collection of rows within each GROUP BY bucket.
| Function | Definition | Interview-Level Insight & Example |
|---|---|---|
COUNT() | Counts the number of rows. | This is the most nuanced and important aggregate function. Your understanding of its three forms is a key differentiator. 1. COUNT(*): Counts all rows within the group, including NULLs and duplicates. This is the most common form for answering “How many items are in this group?”-- Get the number of products in each categorySELECT CategoryID, COUNT(*) AS NumberOfProducts FROM Products GROUP BY CategoryID;2. COUNT(column_name): Counts all rows where column_name is NOT NULL. It ignores NULLs.-- Count how many customers have provided a phone numberSELECT COUNT(PhoneNumber) FROM Customers;3. COUNT(DISTINCT column_name): Counts the number of unique, non-NULL values in the column.-- Count how many unique customers placed orders in a given monthSELECT COUNT(DISTINCT CustomerID) FROM Orders WHERE OrderDate BETWEEN '...'; |
SUM() | Calculates the sum of a numeric column. | Ignores NULL values completely. Crucial for financial reporting.-- Calculate the total dollar amount of sales for each daySELECT DATE(OrderDate), SUM(PriceAtPurchase * Quantity) AS DailySales<br>FROM Order_Items i JOIN Orders o ON i.OrderID = o.OrderID<br>GROUP BY DATE(OrderDate); |
AVG() | Calculates the average of a numeric column. | Also ignores NULL values. This can be a “gotcha.” If a row has a NULL score, it is not treated as a zero; it is excluded entirely from the calculation. -- Find the average price of products in each categorySELECT CategoryID, AVG(Price) AS AveragePrice FROM Products GROUP BY CategoryID; |
MIN() & MAX() | Finds the minimum/maximum value in a column. | Works on numeric, string, and date types. NULLs are ignored.-- Find the date of the earliest and latest order for each customerSELECT CustomerID, MIN(OrderDate) AS FirstOrder, MAX(OrderDate) AS LastOrder<br>FROM Orders GROUP BY CustomerID; |
3. WHERE vs. HAVING: The Great Filter Divide
Section titled “3. WHERE vs. HAVING: The Great Filter Divide”This is one of the most classic and important SQL interview questions. A candidate’s ability to articulate the difference clearly demonstrates their understanding of the query execution pipeline.
- The Golden Rule:
WHEREfilters rows before they are grouped.HAVINGfilters groups after the aggregate functions have been calculated.
The Logical Query Processing Order (Simplified):
FROM -> WHERE -> GROUP BY -> Aggregate Functions -> HAVING -> SELECT
Interview Scenario: The Filtering Test
Interviewer: “I need a list of all product categories that have more than 10 products in them. Which clause would you use to enforce the ‘more than 10 products’ condition, and why?”
Candidate 1 (Incorrect, uses WHERE):
-- THIS QUERY WILL FAILSELECT CategoryID, COUNT(*)FROM ProductsWHERE COUNT(*) > 10 -- You cannot use an aggregate function in WHEREGROUP BY CategoryID;- Why it Fails: The
WHEREclause is processed beforeGROUP BY. At the momentWHEREis filtering individual rows, theCOUNT(*)for a group has not been calculated yet. It doesn’t exist. This will result in an “Invalid use of group function” error.
Candidate 2 (Correct, uses HAVING):
SELECT CategoryID, COUNT(*) AS NumberOfProductsFROM ProductsGROUP BY CategoryIDHAVING NumberOfProducts > 10;Your Expert Explanation:
“The correct clause to use here is HAVING.
- First,
FROM Productsgets all the products. - Then,
GROUP BY CategoryIDputs them into their respective category buckets. - For each bucket, the aggregate function
COUNT(*)is calculated. At this point, we have a temporary result set that looks something like:(CategoryID=1, Count=25),(CategoryID=2, Count=8),(CategoryID=3, Count=15). - Finally, the
HAVINGclause is applied to this new, grouped result set. It checks our calculated aliasNumberOfProducts > 10. The group for Category 2 (with a count of 8) is discarded. - The final result contains only the groups that passed the
HAVINGcondition.
You must use HAVING because it operates on the result of the aggregation, while WHERE operates on the raw data before aggregation.”
Putting It All Together: Advanced Interview Problems
Section titled “Putting It All Together: Advanced Interview Problems”Problem 1: Top-Spending Customers
Interviewer: “Write a query to find the CustomerID and total amount spent for the top 5 customers who have spent more than $1,000 in total.”
Your Expert Answer: “This requires a multi-step aggregation: joining tables, filtering, grouping, filtering the groups, ordering, and finally limiting the result.”
-- (Step 6) Select the final columnsSELECT o.CustomerID, c.FirstName, c.LastName, SUM(i.PriceAtPurchase * i.Quantity) AS TotalSpent-- (Step 1) Start with orders and join to items and customersFROM Orders AS oJOIN Order_Items AS i ON o.OrderID = i.OrderIDJOIN Customers AS c ON o.CustomerID = c.CustomerID-- (Step 2) There is no pre-filtering of rows, so no WHERE clause-- (Step 3) Group by customer to create a bucket for each oneGROUP BY o.CustomerID, c.FirstName, c.LastName-- (Step 4) Filter the GROUPS, keeping only those who spent over $1000HAVING TotalSpent > 1000.00-- (Step 5) Order the remaining groups by the total spent to find the top spendersORDER BY TotalSpent DESC-- (Step 7) Take only the top 5 from the final, sorted listLIMIT 5;Problem 2: Find Products That Have Never Been Sold
Interviewer: “Write a query to find all products that have never appeared in an Order_Items record.”
Your Expert Answer:
“This is a great question that tests relational logic. The most efficient way to solve this is with a LEFT JOIN and a WHERE clause that looks for NULLs.”
SELECT p.ProductID, p.ProductNameFROM Products AS pLEFT JOIN Order_Items AS oi ON p.ProductID = oi.ProductIDWHERE oi.ProductID IS NULL;- Why it Works: A
LEFT JOINwill include all rows from the left table (Products), regardless of whether they have a match in the right table (Order_Items). If a product has never been sold, the columns from theOrder_Itemsside of the join (oi.ProductID, etc.) will beNULL. TheWHERE oi.ProductID IS NULLclause then simply filters the result to show only these unmatched products. This is often more performant than using a subquery withNOT IN.