This post is motivated by reasons very similar to the ones that motivated my React and Redux “tutorial”. Again, it should be more accurately but less informatively titled “How I wish SQL SELECTs were explained to me”. Again, it does not imply that this method of explanation is suitable for anybody else. One difference is that this time, I mostly only wanted to learn about SQL SELECTs to the extent it would help me perform and optimize queries in Django’s ORM, but to prevent this post from languishing forever in my drafts folder, that material has been sectioned off into a possible future post, because I figured out what I wanted, ran out of steam, and am now trying to learn TLA⁺. Just me things.
The SQL standard is confusing and almost never completely implemented; there are huge inconsistencies between SQL implementations. I will focus on SQLite because it’s popular and easy to play with, but generally try to stay away from unpopular or nonstandard features. SQLite’s SELECT documentation is good reading for one particular SQL implementation.
A SQL database is a place where you store and query a bunch of data that’s organized into tables. A table is a homogeneous list of rows. A row is a heterogeneous tuple of values of various simple data types. The data types supported depend on the SQL implementation; typical examples are integers and strings of various sizes, floating point numbers, and dates/datetimes. All of these types can be nullable; NULL is a SQL value that can appear just about anywhere. (Like many of the other SQL features, NULL is handled somewhat inconsistently across SQL implementations, but as a first-order approximation it’s closer to a floating-point NaN than, say, Java’s “null”. We’ll talk more about it later.) However, note that you can’t have a variable-size list of other things in a row. And just to make sure it’s clear, all the rows in a given table must have the same data types in the same order.
A “column” is just what you’d intuitively expect it to be: it’s the homogeneous list of all values in a particular position in each row of a table, which all have the same data type. One thing I haven’t mentioned yet is that table columns all have names. This is true both for tables stored in the database and for the ephemeral tables that are the output of queries.
Since I’ll also be referring to more complex types like lists and tuples often seen in conventional programming languages, I’ll call these simple data types “scalar types” and values of those types “scalars”. This is not SQL terminology; documentation usually just calls these “data types”. Here’s SQLite’s page on data types.
To play along, install SQLite and run it. You should get dropped into a connection to an ephemeral in-memory database, which is plenty enough for our purposes. Make a table and mutter some magic incantations to make the output a little prettier for us:
(The first two commands are SQL and probably portable. The last two “dot-commands” are extremely SQLite-specific.)
Note: SQL is not case-sensitive. In this post, I’ll type all the keywords in uppercase for clarity and consistency with more documentation, but if I were you and just experimenting, I would type
create table a (id int); and so on.
SELECT statements are the main way one gets information out of a database. For the purposes of this post, a
SELECT statement looks like this:
FROM are required; everything else is optional. There are several features I have omitted and am not talking about, to keep things simple and because I have never used them. There are operators you can use to combine multiple result tables that have the exact same column types:
EXCEPT. There are also window functions, which are sort of spiritually related to
GROUP BY, and many others.
At its core,
SELECT (thing) FROM (table); is a map (in the higher-order-function sense), or a list comprehension, of the rows from
(table). You can write an expression in terms of the column names; it will be evaluated for each row. Try these.
You can select more than one expression at once to produce a table with more than one column. In general, these expressions are called result expressions.
You can rename the columns of the resulting table with
AS. (If you don’t rename them, the column names will just be the literal string that’s the expression you wrote.)
* is just shorthand for “all the columns from the table”. If you want to get the whole table, you can just “map with the identity”:
Before we dig deeper into SELECT, it’s worth spending a moment on SQL expressions. Generally, SQL supports the basic arithmetic and comparison operators. Conditional expressions seem to be somewhat standardized as
The whitespace/indentation is not significant and is just for clarity. Omitting the
ELSE clause makes the default
NULL. Most SQL implementations have a bunch of other expressions and functions, but if you want to write any nontrivial expression, look it up for your implementation specifically; it varies widely. I don’t think even string concatenation or taking the min/max of two numbers has a consistent syntax.
Speaking of inconsistent expressions, let’s talk about NULL, the SQL scalar value that can appear almost anywhere because every data type is nullable, and how it works in expressions. As a reminder, its closest counterpart in other programming languages is likely a floating-point NaN rather than something actually called “null”. In SQL, NULL is falsy, but doing any arithmetic and comparison operations on it and anything else produces another NULL. In particular, NULL is not equal to itself! (More precisely, NULL = NULL is NULL again, which is falsy; but note that by the same token, it’s also not not-equal to itself.) Well, except when NULL is equal to NULL in other cases. Here’s SQLite’s page on NULL handling. To check if a value is NULL, use the special syntax
value IS NULL or
value IS NOT NULL.
In general, though, a simple result expression is basically implicitly a function
a.row -> s for some scalar type
s. A single
* is shorthand that expands into many result expressions, one per column. There is also the shorthand
a.*, which expands into one result expression per column from that particular table.
One SQL feature that also involves grouping/aggregation I won’t go into depth about is the “window function”. Using a window function doesn’t change the number of rows in the result, but rows can contain values that were computed by aggregating over multiple rows in the source table; the ranges of rows being aggregated over will thus typically repeat or overlap.
FROM, where we had a table, we can instead have several tables, each
JOINed to the previous in one of a few ways. If you
INNER JOIN two tables and do nothing else, you’re just taking the Cartesian product, much like a nested list comprehension.
INNER JOIN are equivalent.
Completely unnecessarily, a lot of tutorials treat “self-joins” (where you join a table with itself instead of a different table) specially. We’re not going to because there’s really no conceptual difference (although admittedly, there is the notational inconvenience that you have to alias one of the tables to be able to refer to it unambiguously) and because we save creating a table of sample data. A Cartesian product is a Cartesian product is a Cartesian product.
Of course, in the vast majority of use cases, you don’t just want the full Cartesian product of two tables or anything close to it. When you really, really do, implementations sometimes offer another term called CROSS JOIN. Sometimes this is exactly equivalent to JOIN and sometimes it’s not, but from what I can gather, it strongly connotes that you really want the full Cartesian product, rather than being a typical join where the number of returned rows is on the order of just the number of rows on one side or another.
Inner and Outer Joins
This is where most tutorials start giving technically incomplete explanations with Venn diagrams and whatever. Screw that.
You can add
ON followed by a predicate to any
JOIN. The predicate is an expression that filters the Cartesian product. That’s it.
LEFT JOIN aka
LEFT OUTER JOIN is much the same except that it forces every row from the left table to show up at least once. Concretely, if a row from the left table would be completely filtered out from the Cartesian product,
LEFT JOIN will add back a copy of that row combined with an imaginary all-NULL row from the right table, as seen in the bottom row here. (The blank in the second column is a NULL; I couldn’t figure out how to make SQLite show NULLs explicitly.)
RIGHT JOIN is just
LEFT JOIN but with the arguments flipped, so it forces every row from the right table to show up at least once.
FULL JOIN is the union of both: it forces all rows from both sides to show up at least once. SQLite doesn’t even support these two.)
You can join tables and joined products of tables. They follow the same rules.
Okay, although that’s (I believe) a complete formal description of what the result of a
JOIN is, it’s extremely unrepresentative of typical usages of
JOIN, both in terms of performance and intuitive meaning.
Most joins most people care about are across foreign keys. That’s a setup in which some table
a has a unique column like
id (every row has a distinct value of
id), and somewhere else you’ll have another table called
b with a column
a_id that “means” “the row in
a with that
id”. The SQL database might even be configured to enforce this, in which case it won’t allow you to insert rows in
a_id values that aren’t actually
ids of some row in
a. But no matter whether or how it’s enforced, the upshot is that every row in
b is associated with exactly one row of
a. In this case, we call
a_id a “foreign key”.
A typical join across a foreign key looks like
SELECT * FROM b JOIN a ON b.a_id = a.id;. Assuming the database is internally consistent, this
SELECT will return exactly one row for every row in
a_id is nullable. Then
SELECT * FROM b JOIN a ON b.a_id = a.id; will include one row for every row in
b that has an associated row in
a. Often, you don’t want that and would rather have exactly one row per
b, in which case you can do a LEFT JOIN as mentioned above:
SELECT * FROM b LEFT JOIN a ON b.a_id = a.id;. In either of these special (but very common) cases, combining the
JOIN and filtering with
ON has produced something very similar to a regular map or list comprehension across rows of
However, often you will also want to join across a foreign key “backwards”. The query
SELECT * FROM a JOIN b ON a.id = b.a_id; is effectively identical to the first query in the above paragraph, just with the columns swapped. However, replacing
LEFT JOIN in this query has a different effect; it will force every row from
a to appear at least once. And generally, sometimes we’ll want to do this because it will make more sense to view a more complicated query we’re writing as a flatmap over rows of
a instead of a map over rows of
b (e.g. if we join additional tables or add a
GROUP BY clause — see below).
WHERE followed by a predicate is a clause that filters the rows we’ve SELECTed thus far. It’s almost exactly like
ON, except that
WHERE can be used on non-JOINs, so you’ll use it much more often. Can you spot why they differ? It’s because the “add back copies of filtered-out rows” operation performed by
LEFT JOIN and friends comes after
ON but before
WHERE, so if you replace an
ON clause with a
WHERE clause you deprive
LEFT JOIN of its chance to add rows back. In the following example that was very similar to our first
ON example, no row has
a.id = 4.
This is where things get weird.
This is obviously not a map:
a has three rows, but the query only returned one. It’s actually a
catamorphism fold (aka reduce)! You can also precompose and postcompose things onto the fold. Note the difference:
The first query computes (1 + 2 + 3 + 4) + 1; the second query computes (1 + 1) + (2 + 1) + (3 + 1) + (4 + 1).
SUM and functions like it that cause folds are called “aggregation functions”. It’s not that obvious what their “type” is. At first glance it looks like SUM takes a number and outputs a number, but you can’t just, for example, compose it with itself willy-nilly:
The basic idea:
- Every expression is either aggregate or nonaggregate.
- If you
SELECTan aggregate expression, you’ll get a single row, whereas if you
SELECTa nonaggregate expression, you’ll get one row for each row in the source table.
- Most expressions (e.g. constants or column names) are nonaggregate.
- Aggregation functions convert nonaggregate expressions to aggregate expressions; they can’t be called on aggregate expressions.
- Aggregate expressions are “stronger” than nonaggregate expressions when they interact in any other case. If any result expression or subexpression thereof in a
SELECTis an aggregation expression, the entire
SELECTbecomes an aggregation. If there are also nonaggregate result expressions, or if aggregation expressions and nonaggregate expressions are combined anywhere, the nonaggregate expression will just be evaluated at an arbitrary row.
Consider this query:
It returns a single row whose first element is the sum of all
a, but whose second element is the
id from an arbitrary row. Other operators may even work the same way:
will return a single row that consists of the sum of all
a plus the
id from an arbitrary row.
There aren’t that many aggregation functions you can count on existing. I think the intersection of SQLite, MySQL, and Postgres is
COUNT counts the number of times its argument is truthy, but one of the most commonly useful special cases
COUNT(*) just counts the number of rows. The others are hopefully self-explanatory, with one caveat: Other than
COUNT, the aggregation functions all ignore
NULLs, and return
NULL when folded over zero non-
NULL arguments. This makes sense for
AVG, but not necessarily for
SUM (which you might expect to return 0), so be aware. (This is what the SQL standard requires, so it is what it is. SQLite actually has a
TOTAL function that’s exactly like
SUM except that it returns 0 when folded over zero non-NULL arguments, but this is again SQLite-specific.)
The arbitrary-row thing might not seem generally useful at first, but you don’t have to fold over the entire result set. You can instead
GROUP BY some expressions. In such cases, the rows are partitioned into groups on which the expressions being GROUPed BY (as a tuple expression) have the same values. The folds occur over those partitions, and the result contains one row per group.
Below, I group the rows into
(1, 3) and
(2, 4) and take the sum of squares across them.
As before, any non-aggregation result expressions will be evaluated at an arbitrary row in the group. This is what the below query gives me, but I shouldn’t depend on the
id of either row being 4 and 3 instead of 2 and 1.
However, note that selecting nonaggregate expressions that are equal to expressions that are being grouped by, or pure functions thereof, do not introduce arbitrariness, and thus are particularly useful. This query is predictable and potentially useful:
GROUP BY clause can be followed by a
HAVING clause, which is just a filter on the groups rather than the individual rows. I don’t have enough rows in our table to give a great example, but:
Sorting and Slicing
Next, we have clauses to sort and slice the resulting list of rows:
OFFSET. The basic syntax is pretty self-explanatory.
sqlite> SELECT *, a.id + b.id FROM a JOIN a AS b ORDER BY a.id + b.id; id id a.id + b.id ---------- ---------- ----------- 1 1 2 1 2 3 2 1 3 1 3 4 2 2 4 3 1 4 1 4 5 2 3 5 3 2 5 4 1 5 2 4 6 3 3 6 4 2 6 3 4 7 4 3 7 4 4 8 sqlite> SELECT *, a.id + b.id FROM a JOIN a AS b ORDER BY a.id + b.id LIMIT 3; id id a.id + b.id ---------- ---------- ----------- 1 1 2 1 2 3 2 1 3 sqlite> SELECT *, a.id + b.id FROM a JOIN a AS b ORDER BY a.id + b.id LIMIT 3 OFFSET 3; id id a.id + b.id ---------- ---------- ----------- 1 3 4 2 2 4 3 1 4
You can order by multiple expressions. After each expression, you can write ASC or DESC to say if you’re sorting in ascending or descending order.
SQL implementations differ on how NULL is sorted: it could be smaller or greater than everything else. Some implementations let you write
NULLS FIRST or
NULLS LAST after
DESC to specify this, but it’s not universal. You can always work around this with explicitly testing for
NULL and using the
CASE expression mentioned much earlier.
SELECT query is itself an expression. You can use them in many places inside an outer
SELECT query, just by wrapping it in parentheses; they’re called subqueries.
You can even use variables from the outside query! That’s called a correlated subquery. It’s probably more expensive because it has to run a subquery for every value taken by those outside variables, but the database may be able to figure out how to do it more efficiently.
Unfortunately, subqueries can only return a table with a single row and a single column.