Skip to content

Subquery

Mastering Subqueries: The Logic of Nested Questions

Section titled “Mastering Subqueries: The Logic of Nested Questions”

1. The “Why”: The Philosophy of Subqueries

Section titled “1. The “Why”: The Philosophy of Subqueries”

The philosophy of a subquery is “ask a question to help answer another question.”

Imagine you need to solve a problem that requires two pieces of information, and you can’t get them both in one go. For example: “Find all the products that are more expensive than our most expensive book.”

You can’t answer this in a single pass because you first need to determine the price of the “most expensive book.” A subquery allows you to perform this two-step logic within a single SQL statement.

  1. Inner Question: What is the maximum price of any book?
  2. Outer Question: Find all products whose price is greater than the answer to the inner question.

This is the essence of a subquery: the result of the inner query is passed to the outer query, which then uses that result to finish its work.

2. Types of Subqueries: Categorized by What They Return

Section titled “2. Types of Subqueries: Categorized by What They Return”

A subquery is just a SELECT statement, but its behavior and how you use it depend entirely on the shape of the data it returns.

A. Scalar Subquery: Returns a Single Value

This is a subquery that returns exactly one column and one row.

  • Analogy: Think of it as a query that returns a single piece of data, like a variable in a programming language.
  • Use Case: It is used anywhere a single value is expected, most commonly in the WHERE clause with a comparison operator or in the SELECT list.

Example (Solving the problem from the philosophy section):

-- Outer Query
SELECT
ProductName,
Price
FROM
Products
WHERE
Price > (
-- Inner (Scalar) Subquery
SELECT MAX(Price) FROM Products WHERE CategoryID = 2 -- Assuming Category 2 is 'Books'
);```
**How it works:**
1. The database executes the inner query first: `SELECT MAX(Price) FROM Products WHERE CategoryID = 2`. Let's say it returns a single value: `89.99`.
2. The database then substitutes this value into the outer query, which effectively becomes: `... WHERE Price > 89.99;`.
3. The outer query then executes as a simple filtered query.
---
**B. Multi-row Subquery: Returns a Single Column with Multiple Rows**
This is a subquery that returns **one column** but can return **multiple rows**.
* **Analogy:** Think of it as a query that returns a list or an array of values.
* **Use Case:** It is almost always used in a `WHERE` clause with multi-value operators like `IN`, `NOT IN`, `ANY`, or `ALL`.
**Example:** "Find the names of all customers who have placed at least one order." (This is the same problem we solved with `JOIN`, but now solved with a subquery).
```sql
-- Outer Query
SELECT CustomerID, FirstName, LastName
FROM Customers
WHERE CustomerID IN (
-- Inner (Multi-row) Subquery
SELECT DISTINCT CustomerID FROM Orders
);

How it works:

  1. The inner query runs first: SELECT DISTINCT CustomerID FROM Orders. It returns a list of customer IDs, e.g., (1, 2, 5, 8).
  2. The outer query then uses this list in its WHERE clause, effectively becoming: ... WHERE CustomerID IN (1, 2, 5, 8);.

C. Multi-column Subquery (Table Subquery): Returns Multiple Columns and Multiple Rows

This is a subquery that returns a full, table-like result set with multiple columns and multiple rows.

  • Analogy: Think of it as a temporary, virtual table created on the fly.
  • Use Case: It is most often used in the FROM or JOIN clause, where you treat the entire subquery’s result as if it were a real table. When used in the FROM clause, it is often called a derived table.

Example: “Find the average number of orders placed per customer.” This is a two-step aggregation: first, count the orders for each customer; second, find the average of those counts.

-- Outer Query
SELECT
AVG(OrderCount) AS AverageOrdersPerCustomer
FROM (
-- Inner (Table) Subquery
SELECT
CustomerID,
COUNT(OrderID) AS OrderCount
FROM Orders
GROUP BY CustomerID
) AS CustomerOrderCounts; -- The subquery MUST have an alias

How it works:

  1. The inner query runs first, producing a virtual table named CustomerOrderCounts that looks like this:
    CustomerIDOrderCount
    15
    212
    53
  2. The outer query then runs against this temporary, derived table, calculating the average of the OrderCount column.

3. Interview Gold: Correlated Subqueries (The Advanced Topic)

Section titled “3. Interview Gold: Correlated Subqueries (The Advanced Topic)”

This concept separates advanced SQL users from intermediates.

  • Standard Subquery (Uncorrelated): The inner query runs once, independently of the outer query. Its result is passed to the outer query, which then does its work. (All the examples above are uncorrelated).
  • Correlated Subquery: The inner query depends on the outer query for its values. The inner query runs once for every single row processed by the outer query. It is like a nested loop in programming.

The Problem: “For each product, find the number of orders it appears in.” (This can also be solved with a JOIN, but it’s a classic correlated subquery example).

-- Outer Query
SELECT
p.ProductID,
p.ProductName,
(
-- Inner (Correlated) Subquery
SELECT COUNT(*)
FROM Order_Items AS oi
WHERE oi.ProductID = p.ProductID -- The Correlation!
) AS OrderCount
FROM
Products AS p;

How it works:

  1. The outer query gets the first row from Products (e.g., p.ProductID = 1, ‘Laptop’).
  2. It then runs the entire inner query, substituting the value from the outer query: SELECT COUNT(*) FROM Order_Items WHERE ProductID = 1;. Let’s say this returns 50.
  3. The first row of the final result is (1, ‘Laptop’, 50).
  4. The outer query gets the second row from Products (e.g., p.ProductID = 2, ‘Mouse’).
  5. It then re-runs the entire inner query: SELECT COUNT(*) FROM Order_Items WHERE ProductID = 2;. Let’s say this returns 120.
  6. The second row of the final result is (2, ‘Mouse’, 120).
  7. …and so on, for every product in the Products table.

The Performance Trap: Because the inner query runs for every row of the outer query, correlated subqueries can be extremely slow on large tables. A JOIN-based solution is often much more performant because the database optimizer has more freedom to decide the best way to link the tables.

Interviewer: “You showed me three ways to find customers who have placed orders: an INNER JOIN, a subquery with IN, and now I’ll show you a third way with EXISTS. Can you explain the difference, and which you would choose?”

The EXISTS version:

SELECT c.CustomerID, c.FirstName
FROM Customers AS c
WHERE EXISTS (
SELECT 1
FROM Orders AS o
WHERE o.CustomerID = c.CustomerID
);

Your Expert Answer: “This is a fantastic question about query optimization strategies. All three queries can produce the correct result, but they tell the database to ‘think’ about the problem differently.

  1. INNER JOIN: The optimizer treats this as a holistic request to combine both tables and then filter. It will analyze the indexes on both tables and devise a plan to merge them, often by using an index on Orders.CustomerID to efficiently look up matches. This is generally the most performant and declarative option.

  2. Subquery with IN: This tells the database to first create a complete, de-duplicated list of all CustomerIDs from the Orders table, and then scan the Customers table to find matches for that list. Modern optimizers are very good at rewriting IN queries as JOINs, but historically, IN could be slow if the subquery returned a very large list of values.

  3. Correlated Subquery with EXISTS: This is the most different. EXISTS doesn’t care what data the subquery returns; it only cares if it returns at least one row. The subquery SELECT 1 ... is a common convention that means ‘just tell me if you find anything’. For each customer, it performs a search. The key benefit is that the database can stop searching as soon as it finds the first match. If a customer has 10,000 orders, EXISTS finds the first one and immediately returns TRUE and moves on. IN would have had to process all 10,000 orders to build its initial list.

Conclusion: For the ‘customers with orders’ problem, an INNER JOIN is usually the most readable and performant choice. However, EXISTS can be significantly faster than IN or even a JOIN in specific scenarios where you just need to check for the existence of any related record, especially if the number of related records is very large.”