SQL Selects the Hard FP Way

This post is motivated by reasons very similar to the ones that motivated my React and Redux “tutorial”. Again, it should be more accurately but less informatively titled “How I wish SQL SELECTs were explained to me”. Again, it does not imply that this method of explanation is suitable for anybody else. One difference is that this time, I mostly only wanted to learn about SQL SELECTs to the extent it would help me perform and optimize queries in Django’s ORM, but to prevent this post from languishing forever in my drafts folder, that material has been sectioned off into a possible future post, because I figured out what I wanted, ran out of steam, and am now trying to learn TLA⁺. Just me things.

Background

The SQL standard is confusing and almost never completely implemented; there are huge inconsistencies between SQL implementations. I will focus on SQLite because it’s popular and easy to play with, but generally try to stay away from unpopular or nonstandard features. SQLite’s SELECT documentation is good reading for one particular SQL implementation.

A SQL database is a place where you store and query a bunch of data that’s organized into tables. A table is a homogeneous list of rows. A row is a heterogeneous tuple of values of various simple data types. The data types supported depend on the SQL implementation; typical examples are integers and strings of various sizes, floating point numbers, and dates/datetimes. All of these types can be nullable; NULL is a SQL value that can appear just about anywhere. (Like many of the other SQL features, NULL is handled somewhat inconsistently across SQL implementations, but as a first-order approximation it’s closer to a floating-point NaN than, say, Java’s “null”. We’ll talk more about it later.) However, note that you can’t have a variable-size list of other things in a row. And just to make sure it’s clear, all the rows in a given table must have the same data types in the same order.

A “column” is just what you’d intuitively expect it to be: it’s the homogeneous list of all values in a particular position in each row of a table, which all have the same data type. One thing I haven’t mentioned yet is that table columns all have names. This is true both for tables stored in the database and for the ephemeral tables that are the output of queries.

Since I’ll also be referring to more complex types like lists and tuples often seen in conventional programming languages, I’ll call these simple data types “scalar types” and values of those types “scalars”. This is not SQL terminology; documentation usually just calls these “data types”. Here’s SQLite’s page on data types.

To play along, install SQLite and run it. You should get dropped into a connection to an ephemeral in-memory database, which is plenty enough for our purposes. Make a table and mutter some magic incantations to make the output a little prettier for us:

(The first two commands are SQL and probably portable. The last two “dot-commands” are extremely SQLite-specific.)

Note: SQL is not case-sensitive. In this post, I’ll type all the keywords in uppercase for clarity and consistency with more documentation, but if I were you and just experimenting, I would type create table a (id int); and so on.

Overview

SELECT statements are the main way one gets information out of a database. For the purposes of this post, a SELECT statement looks like this:

Only SELECT and FROM are required; everything else is optional. There are several features I have omitted and am not talking about, to keep things simple and because I have never used them. There are operators you can use to combine multiple result tables that have the exact same column types: UNION, UNION ALL, INTERSECT, EXCEPT. There are also window functions, which are sort of spiritually related to GROUP BY, and many others.

SELECT

At its core, SELECT (thing) FROM (table); is a map (in the higher-order-function sense), or a list comprehension, of the rows from (table). You can write an expression in terms of the column names; it will be evaluated for each row. Try these.

You can select more than one expression at once to produce a table with more than one column. In general, these expressions are called result expressions.

You can rename the columns of the resulting table with AS. (If you don’t rename them, the column names will just be the literal string that’s the expression you wrote.)

* is just shorthand for “all the columns from the table”. If you want to get the whole table, you can just “map with the identity”:

Before we dig deeper into SELECT, it’s worth spending a moment on SQL expressions. Generally, SQL supports the basic arithmetic and comparison operators. Conditional expressions seem to be somewhat standardized as CASE/WHEN/THEN/ELSE/END:

The whitespace/indentation is not significant and is just for clarity. Omitting the ELSE clause makes the default NULL. Most SQL implementations have a bunch of other expressions and functions, but if you want to write any nontrivial expression, look it up for your implementation specifically; it varies widely. I don’t think even string concatenation or taking the min/max of two numbers has a consistent syntax.

Speaking of inconsistent expressions, let’s talk about NULL, the SQL scalar value that can appear almost anywhere because every data type is nullable, and how it works in expressions. As a reminder, its closest counterpart in other programming languages is likely a floating-point NaN rather than something actually called “null”. In SQL, NULL is falsy, but doing any arithmetic and comparison operations on it and anything else produces another NULL. In particular, NULL is not equal to itself! (More precisely, NULL = NULL is NULL again, which is falsy; but note that by the same token, it’s also not not-equal to itself.) Well, except when NULL is equal to NULL in other cases. Here’s SQLite’s page on NULL handling. To check if a value is NULL, use the special syntax value IS NULL or value IS NOT NULL.

In general, though, a simple result expression is basically implicitly a function a.row -> s for some scalar type s. A single * is shorthand that expands into many result expressions, one per column. There is also the shorthand a.*, which expands into one result expression per column from that particular table.

One SQL feature that also involves grouping/aggregation I won’t go into depth about is the “window function”. Using a window function doesn’t change the number of rows in the result, but rows can contain values that were computed by aggregating over multiple rows in the source table; the ranges of rows being aggregated over will thus typically repeat or overlap.

Joins

After FROM, where we had a table, we can instead have several tables, each JOINed to the previous in one of a few ways. If you INNER JOIN two tables and do nothing else, you’re just taking the Cartesian product, much like a nested list comprehension. JOIN and INNER JOIN are equivalent.

Completely unnecessarily, a lot of tutorials treat “self-joins” (where you join a table with itself instead of a different table) specially. We’re not going to because there’s really no conceptual difference (although admittedly, there is the notational inconvenience that you have to alias one of the tables to be able to refer to it unambiguously) and because we save creating a table of sample data. A Cartesian product is a Cartesian product is a Cartesian product.

Of course, in the vast majority of use cases, you don’t just want the full Cartesian product of two tables or anything close to it. When you really, really do, implementations sometimes offer another term called CROSS JOIN. Sometimes this is exactly equivalent to JOIN and sometimes it’s not, but from what I can gather, it strongly connotes that you really want the full Cartesian product, rather than being a typical join where the number of returned rows is on the order of just the number of rows on one side or another.

Inner and Outer Joins

This is where most tutorials start giving technically incomplete explanations with Venn diagrams and whatever. Screw that.

You can add ON followed by a predicate to any JOIN. The predicate is an expression that filters the Cartesian product. That’s it.

LEFT JOIN aka LEFT OUTER JOIN is much the same except that it forces every row from the left table to show up at least once. Concretely, if a row from the left table would be completely filtered out from the Cartesian product, LEFT JOIN will add back a copy of that row combined with an imaginary all-NULL row from the right table, as seen in the bottom row here. (The blank in the second column is a NULL; I couldn’t figure out how to make SQLite show NULLs explicitly.)

(RIGHT JOIN is just LEFT JOIN but with the arguments flipped, so it forces every row from the right table to show up at least once. FULL JOIN is the union of both: it forces all rows from both sides to show up at least once. SQLite doesn’t even support these two.)

You can join tables and joined products of tables. They follow the same rules.

Joins, Typically

Okay, although that’s (I believe) a complete formal description of what the result of a JOIN is, it’s extremely unrepresentative of typical usages of JOIN, both in terms of performance and intuitive meaning.

Most joins most people care about are across foreign keys. That’s a setup in which some table a has a unique column like id (every row has a distinct value of id), and somewhere else you’ll have another table called b with a column a_id that “means” “the row in a with that id”. The SQL database might even be configured to enforce this, in which case it won’t allow you to insert rows in b with a_id values that aren’t actually ids of some row in a. But no matter whether or how it’s enforced, the upshot is that every row in b is associated with exactly one row of a. In this case, we call a_id a “foreign key”.

A typical join across a foreign key looks like SELECT * FROM b JOIN a ON b.a_id = a.id;. Assuming the database is internally consistent, this SELECT will return exactly one row for every row in b. Sometimes a_id is nullable. Then SELECT * FROM b JOIN a ON b.a_id = a.id; will include one row for every row in b that has an associated row in a. Often, you don’t want that and would rather have exactly one row per b, in which case you can do a LEFT JOIN as mentioned above: SELECT * FROM b LEFT JOIN a ON b.a_id = a.id;. In either of these special (but very common) cases, combining the JOIN and filtering with ON has produced something very similar to a regular map or list comprehension across rows of b.

However, often you will also want to join across a foreign key “backwards”. The query SELECT * FROM a JOIN b ON a.id = b.a_id; is effectively identical to the first query in the above paragraph, just with the columns swapped. However, replacing JOIN with LEFT JOIN in this query has a different effect; it will force every row from a to appear at least once. And generally, sometimes we’ll want to do this because it will make more sense to view a more complicated query we’re writing as a flatmap over rows of a instead of a map over rows of b (e.g. if we join additional tables or add a GROUP BY clause — see below).

WHERE

WHERE followed by a predicate is a clause that filters the rows we’ve SELECTed thus far. It’s almost exactly like ON, except that WHERE can be used on non-JOINs, so you’ll use it much more often. Can you spot why they differ? It’s because the “add back copies of filtered-out rows” operation performed by LEFT JOIN and friends comes after ON but before WHERE, so if you replace an ON clause with a WHERE clause you deprive LEFT JOIN of its chance to add rows back. In the following example that was very similar to our first ON example, no row has a.id = 4.

Aggregation

This is where things get weird.

This is obviously not a map: a has three rows, but the query only returned one. It’s actually a catamorphism fold (aka reduce)! You can also precompose and postcompose things onto the fold. Note the difference:

The first query computes (1 + 2 + 3 + 4) + 1; the second query computes (1 + 1) + (2 + 1) + (3 + 1) + (4 + 1).

SUM and functions like it that cause folds are called “aggregation functions”. It’s not that obvious what their “type” is. At first glance it looks like SUM takes a number and outputs a number, but you can’t just, for example, compose it with itself willy-nilly:

The basic idea:

  • Every expression is either aggregate or nonaggregate.
  • If you SELECT an aggregate expression, you’ll get a single row, whereas if you SELECT a nonaggregate expression, you’ll get one row for each row in the source table.
  • Most expressions (e.g. constants or column names) are nonaggregate.
  • Aggregation functions convert nonaggregate expressions to aggregate expressions; they can’t be called on aggregate expressions.
  • Aggregate expressions are “stronger” than nonaggregate expressions when they interact in any other case. If any result expression or subexpression thereof in a SELECT is an aggregation expression, the entire SELECT becomes an aggregation. If there are also nonaggregate result expressions, or if aggregation expressions and nonaggregate expressions are combined anywhere, the nonaggregate expression will just be evaluated at an arbitrary row.

Consider this query:

It returns a single row whose first element is the sum of all ids in a, but whose second element is the id from an arbitrary row. Other operators may even work the same way:

will return a single row that consists of the sum of all ids in a plus the id from an arbitrary row.

There aren’t that many aggregation functions you can count on existing. I think the intersection of SQLite, MySQL, and Postgres is COUNT, SUM, MIN, MAX, and AVG (average). COUNT counts the number of times its argument is truthy, but one of the most commonly useful special cases COUNT(*) just counts the number of rows. The others are hopefully self-explanatory, with one caveat: Other than COUNT, the aggregation functions all ignore NULLs, and return NULL when folded over zero non-NULL arguments. This makes sense for MIN, MAX, and AVG, but not necessarily for SUM (which you might expect to return 0), so be aware. (This is what the SQL standard requires, so it is what it is. SQLite actually has a TOTAL function that’s exactly like SUM except that it returns 0 when folded over zero non-NULL arguments, but this is again SQLite-specific.)

Grouping

The arbitrary-row thing might not seem generally useful at first, but you don’t have to fold over the entire result set. You can instead GROUP BY some expressions. In such cases, the rows are partitioned into groups on which the expressions being GROUPed BY (as a tuple expression) have the same values. The folds occur over those partitions, and the result contains one row per group.

Below, I group the rows into (1, 3) and (2, 4) and take the sum of squares across them.

As before, any non-aggregation result expressions will be evaluated at an arbitrary row in the group. This is what the below query gives me, but I shouldn’t depend on the id of either row being 4 and 3 instead of 2 and 1.

However, note that selecting nonaggregate expressions that are equal to expressions that are being grouped by, or pure functions thereof, do not introduce arbitrariness, and thus are particularly useful. This query is predictable and potentially useful:

A GROUP BY clause can be followed by a HAVING clause, which is just a filter on the groups rather than the individual rows. I don’t have enough rows in our table to give a great example, but:

Sorting and Slicing

Next, we have clauses to sort and slice the resulting list of rows: ORDER BY, LIMIT, and OFFSET. The basic syntax is pretty self-explanatory.

You can order by multiple expressions. After each expression, you can write ASC or DESC to say if you’re sorting in ascending or descending order.

SQL implementations differ on how NULL is sorted: it could be smaller or greater than everything else. Some implementations let you write NULLS FIRST or NULLS LAST after ASC or DESC to specify this, but it’s not universal. You can always work around this with explicitly testing for NULL and using the CASE expression mentioned much earlier.

Nested Queries

A SELECT query is itself an expression. You can use them in many places inside an outer SELECT query, just by wrapping it in parentheses; they’re called subqueries.

You can even use variables from the outside query! That’s called a correlated subquery. It’s probably more expensive because it has to run a subquery for every value taken by those outside variables, but the database may be able to figure out how to do it more efficiently.

Unfortunately, subqueries can only return a table with a single row and a single column.

(note: the commenting setup here is experimental and I may not check my comments often; if you want to tell me something instead of the world, email me!)