[Follow our blog posts, obsession with data, and original articles on Twitter @RJMetrics]
As you may know, we use MySQL for both data storage and a significant portion of our analytical legwork here at RJMetrics. I’m normally quite pleased with the breadth of resources available for MySQL DBAs and Developers, particularly the online MySQL Reference Manual (which is rich with great comments from a really strong community). However, I recently came across a topic that seems to get brushed over in the MySQL Reference Manual, the forums, and even the MySQL Certification Manual. This topic is the correlated subquery. I thought I would share some of my findings about this little-known feature. While my examples are only tested in MySQL, I should note that the correlated subquery syntax is a part of standard SQL, so you should be able to use it in pretty much any relational database platform.
To provide some background, a subquery is a SELECT statement that is placed within parentheses inside another SQL statement. In most cases, you can easily replicate the effect of a subquery with a JOIN (or vice versa), but using subqueries can really help the logical flow of an SQL statement. Keywords like “IN” and “ANY” allow for statements that are quite logical to deconstruct and even understand when read aloud. As an example, look at this query that lists all of our customers with an address in New Jersey:
WHERE CustomerID = ANY
(SELECT CustomerID FROM AddressBook WHERE State = ‘NJ’);
The subquery here is the parenthetical statement in the WHERE clause: (SELECT CustomerID FROM AddressBook WHERE State = ‘NJ’)
This subquery is what we call “non-correlated” because you could run it all by itself and get a perfectly logical (and error-free) result. In this case, the isolated subquery would yield a list of the CustomerIDs of customers from NJ.
However, there is another breed of subquery known as the “correlated subquery” that contains references to values in the outer query and can not be evaluated independently of that outer query. Just about every resource I could find on the web for “MySQL correlated subquery” offers up an example query structured exactly like this:
SELECT * FROM t1 WHERE column1 = ANY
(SELECT column1 FROM t2 WHERE t2.column2 = t1.column2);
Notice how the table t1 is referenced in the WHERE clause of the subquery even though it it’s not named in the FROM clause of the subquery itself; it only exists in the outer query. If you were to execute just the isolated subquery in this case, you would receive an error.
This seems like it’s got the potential add some serious efficiency and elegance to certain queries. In fact, when you think about what this allows you to do, it’s a bit mind-boggling. If we let every subquery reference any tables in outer query, do we open up a Pandora’s Box of complexity? This was my first reaction, especially when I considered that subqueries can be included in the FROM clause. Wouldn’t that allow for a correlated subquery in the WHERE clause referencing another correlated subquery in the FROM clause that then references another correlated subquery right beside it? But, there’s the rub!
Unlike their non-correlated bretheren, correlated subqueries are NOT allowed in the FROM clause of a query. This key rule, which is for some reason not mentioned on the Correlated Subqueries page of the MySQL Reference Manual, significantly limits the scope of what a correlated subquery can do (and makes it much, much easier to understand).
So, if a correlated subquery can’t be used in the FROM clause of a query, where can it be used? As we saw in the example, it can be used in the WHERE clause (this is most common). Like its non-correlated sibling, it can also be used in the HAVING clause and the SELECT clause.
What do each of these allowable clauses have in common? It’s simple: they are used AFTER data is pulled to either eliminate selected rows or change which columns are displayed! In other words, if you took out the subquery entirely, you would still get some perfectly logical result (just missing the subquery logic).
For example, lets look at two clauses where a correlated subquery could exist and think about the impact they have on their queries. First, think about how the WHERE clause works in this statement:
WHERE age >= 5
Without the WHERE clause, the query would run just fine and return every dog’s name. However, the WHERE clause eliminates rows where age is less than 5.
Similarly, look at the expression “age * 7″ SELECT clause here:
SELECT Name, age * 7 AS HumanAge
WHERE age >= 5
Without “age * 7″ this query would run fine and return the same result set as our last query. However, adding an expression in the SELECT query causes it to add a column to the result set that contains a value specific to each row in the result.
What do these two examples have in common? Both the WHERE clause and the SELECT expression are run on a result set that already exists as a result of the rest of the query. Neither can add rows to the result set, they can only limit rows (WHERE) or add columns (SELECT). As such, without exception, the expressions that appear in these clauses are evaluated against every single record that would have been returned had they not existed (this is an oversimplification of how the query runs, especially if there is an index on age, but for our purposes it’s a perfectly good way of thinking about it).
So, now we ask the question: what happens when we replace our WHERE comparison or SELECT expression with a correlated subquery? Exactly the same thing: the subquery is executed for every single record that exists in the result set prior to its consideration. Therefore, for each record, you can reference the value in any column of any table in the outer FROM clause. Take a look at this correlated subquery (dogs.Name is assumed contain only unique values):
FROM dogs d
WHERE (SELECT MAX(HaircutDate)
FROM haircuts h
WHERE h.Name = d.Name) < ’2008-09-01′
This query gives us the name of each dog that hasn’t had a haircut since September 2008. When this query is run, the subquery in the WHERE clause is executed for EVERY single row in the result set that would have existed if the subquery wasn’t there. In other words, if there are 20 dogs in the dogs table, MySQL runs the subquery 20 times– each time replacing the reference d.Name with the name of the dog in the given row.
Also, notice that we could use this subquery in the SELECT clause to tell us the most recent haircut date for each dog:
SELECT Name, (SELECT MAX(HaircutDate)
FROM haircuts h
WHERE h.Name = d.Name) AS LastHaircut
FROM dogs d
Like all subqueries on WHERE and SELECT clauses, correlated subqueries are designed to return to return a scalar value (not a table of results), or in some cases single row of values (which requires special syntax I won’t get into here). Subquery-related keywords such as IN, ANY, SOME, and EXISTS only return TRUE or FALSE, making them ideal for use in WHERE clause subqueries.
You can also use correlated subqueries in the WHERE clause of UPDATE and DELETE statements to narrow down which rows are affected by the statement.
As a parting note, I urge you to be careful when putting correlated subqueries to use. As I mentioned, you can always use a JOIN to achieve a similar effect, and in some cases this may allow for faster query execution (you can use the EXPLAIN keyword to do testing). Also, one of the main reasons I like subqueries in general is that they can be extracted from their outer queries and understood on their own, which makes it easier to figure out what’s going in a complex query. This won’t be the case with the correlated subquery.
Anyway, keep this one handy in your MySQL toolbox—you may be putting it to work sooner than you think!