This post is motivated by reasons very similar to the ones that motivated my React and Redux “tutorial”. Again, it should be more accurately but less informatively titled “How I wish SQL SELECTs were explained to me”. Again, it does not imply that this method of explanation is suitable for anybody else. One difference is that this time, I mostly only wanted to learn about SQL SELECTs to the extent it would help me perform and optimize queries in Django’s ORM, but to prevent this post from languishing forever in my drafts folder, that material has been sectioned off into a possible future post, because I figured out what I wanted, ran out of steam, and am now trying to learn TLA⁺. Just me things.
Background
The SQL standard is confusing and almost never completely implemented; there are huge inconsistencies between SQL implementations. I will focus on SQLite because it’s popular and easy to play with, but generally try to stay away from unpopular or nonstandard features. SQLite’s SELECT documentation is good reading for one particular SQL implementation.
A SQL database is a place where you store and query a bunch of data that’s organized into tables. A table is a homogeneous list of rows. A row is a heterogeneous tuple of values of various simple data types. The data types supported depend on the SQL implementation; typical examples are integers and strings of various sizes, floating point numbers, and dates/datetimes. All of these types can be nullable; NULL is a SQL value that can appear just about anywhere. (Like many of the other SQL features, NULL is handled somewhat inconsistently across SQL implementations, but as a first-order approximation it’s closer to a floating-point NaN than, say, Java’s “null”. We’ll talk more about it later.) However, note that you can’t have a variable-size list of other things in a row. And just to make sure it’s clear, all the rows in a given table must have the same data types in the same order.
A “column” is just what you’d intuitively expect it to be: it’s the homogeneous list of all values in a particular position in each row of a table, which all have the same data type. One thing I haven’t mentioned yet is that table columns all have names. This is true both for tables stored in the database and for the ephemeral tables that are the output of queries.
Since I’ll also be referring to more complex types like lists and tuples often seen in conventional programming languages, I’ll call these simple data types “scalar types” and values of those types “scalars”. This is not SQL terminology; documentation usually just calls these “data types”. Here’s SQLite’s page on data types.
To play along, install SQLite and run it. You should get dropped into a connection to an ephemeral in-memory database, which is plenty enough for our purposes. Make a table and mutter some magic incantations to make the output a little prettier for us:
CREATE TABLE a (id int);
INSERT INTO a VALUES (1), (2), (3), (4);
on
.headers mode column .
(The first two commands are SQL and probably portable. The last two “dot-commands” are extremely SQLite-specific.)
Note: SQL is not case-sensitive. In this post, I’ll type all the
keywords in uppercase for clarity and consistency with more
documentation, but if I were you and just experimenting, I would type
create table a (id int);
and so on.
Overview
SELECT
statements are the main way one gets information
out of a database. For the purposes of this post, a SELECT
statement looks like this:
SELECT (optionally, DISTINCT) (1+ result columns)
FROM (1+ tables or subqueries)
some kind of) JOIN 0+ other tables ON some condition
(WHERE some condition
GROUP BY some expressions
HAVING some condition
ORDER BY (1+ expressions)
LIMIT expression OFFSET expression
Only SELECT
and FROM
are required;
everything else is optional. There are several features I have omitted
and am not talking about, to keep things simple and because I have never
used them. There are operators you can use to combine multiple result
tables that have the exact same column types: UNION
,
UNION ALL
, INTERSECT
, EXCEPT
.
There are also window functions, which are sort of spiritually related
to GROUP BY
, and many others.
SELECT
At its core, SELECT (thing) FROM (table);
is a map (in
the higher-order-function sense), or a list comprehension, of the rows
from (table)
. You can write an expression in terms of the
column names; it will be evaluated for each row. Try these.
SELECT 1 FROM a;
SELECT id FROM a;
SELECT id * id FROM a;
You can select more than one expression at once to produce a table with more than one column. In general, these expressions are called result expressions.
SELECT 1, id, id * id FROM a;
You can rename the columns of the resulting table with
AS
. (If you don’t rename them, the column names will just
be the literal string that’s the expression you wrote.)
SELECT id * id AS id_squared FROM a;
*
is just shorthand for “all the columns from the
table”. If you want to get the whole table, you can just “map with the
identity”:
SELECT * FROM a;
Before we dig deeper into SELECT, it’s worth spending a moment on SQL
expressions. Generally, SQL supports the basic arithmetic and comparison
operators. Conditional expressions seem to be somewhat standardized as
CASE
/WHEN
/THEN
/ELSE
/END
:
SELECT CASE
WHEN id = 1 THEN 11111
WHEN id = 2 THEN 22222
WHEN id = 3 THEN 33333
ELSE 44444
END FROM a;
The whitespace/indentation is not significant and is just for
clarity. Omitting the ELSE
clause makes the default
NULL
. Most SQL implementations have a bunch of other
expressions and functions, but if you want to write any nontrivial
expression, look it up for your implementation specifically; it varies
widely. I don’t think even string concatenation or taking the min/max of
two numbers has a consistent syntax.
Speaking of inconsistent expressions, let’s talk about NULL, the SQL
scalar value that can appear almost anywhere because every data type is
nullable, and how it works in expressions. As a reminder, its closest
counterpart in other programming languages is likely a floating-point
NaN rather than something actually called “null”. In SQL, NULL is falsy,
but doing any arithmetic and comparison operations on it and anything
else produces another NULL. In particular, NULL is not equal to itself!
(More precisely, NULL = NULL is NULL again, which is falsy; but note
that by the same token, it’s also not not-equal to itself.) Well, except
when NULL is equal to NULL in other cases. Here’s SQLite’s page on NULL handling.
To check if a value is NULL, use the special syntax
value IS NULL
or value IS NOT NULL
.
In general, though, a simple result expression is basically
implicitly a function a.row -> s
for some scalar type
s
. A single *
is shorthand that expands into
many result expressions, one per column. There is also the shorthand
a.*
, which expands into one result expression per column
from that particular table.
One SQL feature that also involves grouping/aggregation I won’t go into depth about is the “window function”. Using a window function doesn’t change the number of rows in the result, but rows can contain values that were computed by aggregating over multiple rows in the source table; the ranges of rows being aggregated over will thus typically repeat or overlap.
Joins
After FROM
, where we had a table, we can instead have
several tables, each JOIN
ed to the previous in one of a few
ways. If you INNER JOIN
two tables and do nothing else,
you’re just taking the Cartesian product, much like a nested list
comprehension. JOIN
and INNER JOIN
are
equivalent.
> SELECT *, a.id + b.id FROM a JOIN a AS b;
sqliteid id a.id + b.id
---------- ---------- -----------
1 1 2
1 2 3
1 3 4
1 4 5
2 1 3
2 2 4
2 3 5
2 4 6
3 1 4
3 2 5
3 3 6
3 4 7
4 1 5
4 2 6
4 3 7
4 4 8
Completely unnecessarily, a lot of tutorials treat “self-joins” (where you join a table with itself instead of a different table) specially. We’re not going to because there’s really no conceptual difference (although admittedly, there is the notational inconvenience that you have to alias one of the tables to be able to refer to it unambiguously) and because we save creating a table of sample data. A Cartesian product is a Cartesian product is a Cartesian product.
Of course, in the vast majority of use cases, you don’t just want the full Cartesian product of two tables or anything close to it. When you really, really do, implementations sometimes offer another term called CROSS JOIN. Sometimes this is exactly equivalent to JOIN and sometimes it’s not, but from what I can gather, it strongly connotes that you really want the full Cartesian product, rather than being a typical join where the number of returned rows is on the order of just the number of rows on one side or another.
Inner and Outer Joins
This is where most tutorials start giving technically incomplete explanations with Venn diagrams and whatever. Screw that.
You can add ON
followed by a predicate to any
JOIN
. The predicate is an expression that filters the
Cartesian product. That’s it.
> SELECT *, a.id + b.id FROM a JOIN a AS b ON a.id < b.id;
sqliteid id a.id + b.id
---------- ---------- -----------
1 2 3
1 3 4
1 4 5
2 3 5
2 4 6
3 4 7
LEFT JOIN
aka LEFT OUTER JOIN
is much the
same except that it forces every row from the left table to show up at
least once. Concretely, if a row from the left table would be completely
filtered out from the Cartesian product, LEFT JOIN
will add
back a copy of that row combined with an imaginary all-NULL row from the
right table, as seen in the bottom row here. (The blank in the second
column is a NULL; I couldn’t figure out how to make SQLite show NULLs
explicitly.)
> SELECT * FROM a LEFT JOIN a AS b ON a.id < b.id;
sqliteid id
---------- ----------
1 2
1 3
1 4
2 3
2 4
3 4
4
(RIGHT JOIN
is just LEFT JOIN
but with the
arguments flipped, so it forces every row from the right table to show
up at least once. FULL JOIN
is the union of both: it forces
all rows from both sides to show up at least once. SQLite doesn’t even
support these two.)
You can join tables and joined products of tables. They follow the same rules.
Joins, Typically
Okay, although that’s (I believe) a complete formal description of
what the result of a JOIN
is, it’s extremely
unrepresentative of typical usages of JOIN
, both in terms
of performance and intuitive meaning.
Most joins most people care about are across foreign
keys. That’s a setup in which some table a
has a
unique column like id
(every row has a distinct
value of id
), and somewhere else you’ll have another table
called b
with a column a_id
that “means” “the
row in a
with that id
”. The SQL database might
even be configured to enforce this, in which case it won’t allow you to
insert rows in b
with a_id
values that aren’t
actually id
s of some row in a
. But no matter
whether or how it’s enforced, the upshot is that every row in
b
is associated with exactly one row of a
. In
this case, we call a_id
a “foreign key”.
A typical join across a foreign key looks like
SELECT * FROM b JOIN a ON b.a_id = a.id;
. Assuming the
database is internally consistent, this SELECT
will return
exactly one row for every row in b
. Sometimes
a_id
is nullable. Then
SELECT * FROM b JOIN a ON b.a_id = a.id;
will include one
row for every row in b
that has an associated row in
a
. Often, you don’t want that and would rather have exactly
one row per b
, in which case you can do a LEFT JOIN as
mentioned above:
SELECT * FROM b LEFT JOIN a ON b.a_id = a.id;
. In either of
these special (but very common) cases, combining the JOIN
and filtering with ON
has produced something very similar
to a regular map or list comprehension across rows of
b
.
However, often you will also want to join across a foreign key
“backwards”. The query
SELECT * FROM a JOIN b ON a.id = b.a_id;
is effectively
identical to the first query in the above paragraph, just with the
columns swapped. However, replacing JOIN
with
LEFT JOIN
in this query has a different effect; it will
force every row from a
to appear at least once. And
generally, sometimes we’ll want to do this because it will make more
sense to view a more complicated query we’re writing as a flatmap over
rows of a
instead of a map over rows of b
(e.g. if we join additional tables or add a GROUP BY
clause
— see below).
WHERE
WHERE
followed by a predicate is a clause that filters
the rows we’ve SELECTed thus far. It’s almost exactly like
ON
, except that WHERE
can be used on
non-JOINs, so you’ll use it much more often. Can you spot why they
differ? It’s because the “add back copies of filtered-out rows”
operation performed by LEFT JOIN
and friends comes after
ON
but before WHERE
, so if you replace an
ON
clause with a WHERE
clause you deprive
LEFT JOIN
of its chance to add rows back. In the following
example that was very similar to our first ON
example, no
row has a.id = 4
.
> SELECT * FROM a LEFT JOIN a AS b WHERE a.id < b.id;
sqliteid id
---------- ----------
1 2
1 3
1 4
2 3
2 4
3 4
Aggregation
This is where things get weird.
> SELECT SUM(id) FROM a;
sqliteSUM(id)
----------
10
This is obviously not a map: a
has three rows, but the
query only returned one. It’s actually a catamorphism fold
(aka reduce)! You can also precompose and postcompose things onto the
fold. Note the difference:
> SELECT SUM(id) + 1 FROM a;
sqliteSUM(id) + 1
-----------
11
> SELECT SUM(id + 1) FROM a;
sqliteSUM(id + 1)
-----------
14
The first query computes (1 + 2 + 3 + 4) + 1; the second query computes (1 + 1) + (2 + 1) + (3 + 1) + (4 + 1).
SUM
and functions like it that cause folds are called
“aggregation functions”. It’s not that obvious what their “type” is. At
first glance it looks like SUM takes a number and outputs a number, but
you can’t just, for example, compose it with itself willy-nilly:
> SELECT SUM(SUM(id)) FROM a;
sqliteof aggregate function SUM() Error: misuse
The basic idea:
- Every expression is either aggregate or nonaggregate.
- If you
SELECT
an aggregate expression, you’ll get a single row, whereas if youSELECT
a nonaggregate expression, you’ll get one row for each row in the source table. - Most expressions (e.g. constants or column names) are nonaggregate.
- Aggregation functions convert nonaggregate expressions to aggregate expressions; they can’t be called on aggregate expressions.
- Aggregate expressions are “stronger” than nonaggregate expressions
when they interact in any other case. If any result expression or
subexpression thereof in a
SELECT
is an aggregation expression, the entireSELECT
becomes an aggregation. If there are also nonaggregate result expressions, or if aggregation expressions and nonaggregate expressions are combined anywhere, the nonaggregate expression will just be evaluated at an arbitrary row.
Consider this query:
SELECT SUM(id), id FROM a;
It returns a single row whose first element is the sum of all
id
s in a
, but whose second element is the
id
from an arbitrary row. Other operators may even work the
same way:
SELECT SUM(id) + id FROM a;
will return a single row that consists of the sum of all
id
s in a
plus the id
from an
arbitrary row.
There aren’t that many aggregation functions you can count on
existing. I think the intersection of SQLite, MySQL, and Postgres is
COUNT
, SUM
, MIN
,
MAX
, and AVG
(average). COUNT
counts the number of times its argument is truthy, but one of the most
commonly useful special cases COUNT(*)
just counts the
number of rows. The others are hopefully self-explanatory, with one
caveat: Other than COUNT
, the aggregation
functions all ignore NULL
s, and return
NULL
when folded over zero non-NULL
arguments. This makes sense for MIN
,
MAX
, and AVG
, but not necessarily for
SUM
(which you might expect to return 0), so be aware.
(This is what the SQL standard requires, so it is what it is. SQLite
actually has a TOTAL
function that’s exactly like
SUM
except that it returns 0 when folded over zero non-NULL
arguments, but this is again SQLite-specific.)
Grouping
The arbitrary-row thing might not seem generally useful at first, but
you don’t have to fold over the entire result set. You can
instead GROUP BY
some expressions. In such cases, the rows
are partitioned into groups on which the expressions being GROUPed BY
(as a tuple expression) have the same values. The folds occur over those
partitions, and the result contains one row per group.
Below, I group the rows into (1, 3)
and
(2, 4)
and take the sum of squares across them.
> SELECT SUM(id * id) FROM a GROUP BY id % 2;
sqliteSUM(id * id)
------------
20
10
As before, any non-aggregation result expressions will be evaluated
at an arbitrary row in the group. This is what the below query gives me,
but I shouldn’t depend on the id
of either row being 4 and
3 instead of 2 and 1.
> SELECT id, SUM(id * id) FROM a GROUP BY id % 2;
sqliteid SUM(id * id)
---------- ------------
4 20
3 10
However, note that selecting nonaggregate expressions that are equal to expressions that are being grouped by, or pure functions thereof, do not introduce arbitrariness, and thus are particularly useful. This query is predictable and potentially useful:
> SELECT id % 2, SUM(id * id) FROM a GROUP BY id % 2;
sqliteid % 2 SUM(id * id)
---------- ------------
0 20
1 10
A GROUP BY
clause can be followed by a
HAVING
clause, which is just a filter on the groups rather
than the individual rows. I don’t have enough rows in our table to give
a great example, but:
> SELECT id % 2, SUM(id * id) FROM a GROUP BY id % 2 HAVING id % 2 != 1;
sqliteid % 2 SUM(id * id)
---------- ------------
0 20
Sorting and Slicing
Next, we have clauses to sort and slice the resulting list of rows:
ORDER BY
, LIMIT
, and OFFSET
. The
basic syntax is pretty self-explanatory.
> SELECT *, a.id + b.id FROM a JOIN a AS b ORDER BY a.id + b.id;
sqliteid id a.id + b.id
---------- ---------- -----------
1 1 2
1 2 3
2 1 3
1 3 4
2 2 4
3 1 4
1 4 5
2 3 5
3 2 5
4 1 5
2 4 6
3 3 6
4 2 6
3 4 7
4 3 7
4 4 8
> SELECT *, a.id + b.id FROM a JOIN a AS b ORDER BY a.id + b.id LIMIT 3;
sqliteid id a.id + b.id
---------- ---------- -----------
1 1 2
1 2 3
2 1 3
> SELECT *, a.id + b.id FROM a JOIN a AS b ORDER BY a.id + b.id LIMIT 3 OFFSET 3;
sqliteid id a.id + b.id
---------- ---------- -----------
1 3 4
2 2 4
3 1 4
You can order by multiple expressions. After each expression, you can write ASC or DESC to say if you’re sorting in ascending or descending order.
> SELECT *, a.id + b.id FROM a JOIN a AS b ORDER BY b.id DESC, a.id DESC;
sqliteid id a.id + b.id
---------- ---------- -----------
4 4 8
3 4 7
2 4 6
1 4 5
4 3 7
3 3 6
2 3 5
1 3 4
4 2 6
3 2 5
2 2 4
1 2 3
4 1 5
3 1 4
2 1 3
1 1 2
SQL implementations differ on how NULL is sorted: it could be smaller
or greater than everything else. Some implementations let you write
NULLS FIRST
or NULLS LAST
after
ASC
or DESC
to specify this, but it’s not
universal. You can always work around this with explicitly testing for
NULL
and using the CASE
expression mentioned
much earlier.
Nested Queries
A SELECT
query is itself an expression. You can use them
in many places inside an outer SELECT
query, just by
wrapping it in parentheses; they’re called subqueries.
> SELECT id + (SELECT SUM(id) FROM a) FROM a;
sqliteid + (SELECT SUM(id) FROM a)
----------------------------
11
12
13
14
You can even use variables from the outside query! That’s called a correlated subquery. It’s probably more expensive because it has to run a subquery for every value taken by those outside variables, but the database may be able to figure out how to do it more efficiently.
SELECT id + (SELECT SUM(b.id) FROM a AS b WHERE b.id > a.id) FROM a;
id + (select sum(b.id) from a as b where b.id > a.id)
-----------------------------------------------------
10
9
7
Unfortunately, subqueries can only return a table with a single row and a single column.