Writing a simple SQL interpreter in Julia¶

So I've felt for a while that databases and SQL were somewhat of a weak spot in my CS knowledge. I've never taken a databases class, nor really read much about them. Though a few years' on-the-job exposure did help some (I could, like, put together a query to get some data), it just didn't feel intuitive. I used to say things like, "but what does this join really meannn??? Like, what's it actually doing!?"

So I took the Khan Academy "Intro to SQL" course over the weekend, and it was super helpful! I'd highly recommend it to anyone who felt like me. Cause it turns out, hey, a join is really simple! You can think of it as basically computing the Cartesian Product of your two tables, and then filtering out rows based on how you specified the query. No big deal.

So then, of course, armed with my new knowledge, I couldn't resist trying to implement it all myself. And man, things like this are just so fun in julia!

So here's the result! I thought it would be fun to paste this up on the web just because it's a cool example of building a domain-specific-language within julia through its super expressive metaprogramming.

The code is all here, in this github repo, in a package called SimpleSQL: https://github.com/NHDaly/SimpleSQL.jl
It all fits in just this one file: src/SimpleSQL.jl

So before I give any more details... Look, it's just like SQL!:

using SimpleSQL

@CREATE @TABLE groceries (:id, Int), (:name, String), (:quantity, Int), (:aisle, Int)
@INSERT @INTO groceries @VALUES (1, "Bananas", 34, 7);
@INSERT @INTO groceries @VALUES (2, "Peanut Butter", 1, 2);
@INSERT @INTO groceries @VALUES (3, "Dark Chocolate Bars", 2, 2);
@INSERT @INTO groceries @VALUES (4, "Ice cream", 1, 12);
@INSERT @INTO groceries @VALUES (5, "Cherries", 6, 2);
@INSERT @INTO groceries @VALUES (6, "Chocolate syrup", 1, 4);
@SELECT (*) @FROM groceries

id |                name | quantity | aisle
---|---------------------|----------|------
 1 |             Bananas |       34 |     7
 2 |       Peanut Butter |        1 |     2
 3 | Dark Chocolate Bars |        2 |     2
 4 |           Ice cream |        1 |    12
 5 |            Cherries |        6 |     2
 6 |     Chocolate syrup |        1 |     4

@SELECT name, aisle, quantity @FROM groceries @WHERE quantity .> 1

               name | aisle | quantity
--------------------|-------|---------
            Bananas |     7 |       34
Dark Chocolate Bars |     2 |        2
           Cherries |     2 |        6

# How many unique things and total things do I have in each of the first 5 aisles in the store?
@SELECT aisle, length(aisle), sum(quantity) @FROM groceries @WHERE aisle .<= 5 @GROUP @BY aisle

aisle | length(aisle) | sum(quantity)
------|---------------|--------------
    2 |             3 |             9
    4 |             1 |             1

Cool, right??

So yeah! In order to try to get a better sense of what SQL is doing, I built a small SQL emulator within julia. You can build up tables in memory, and then load and save them to disk. I only implemented a few of the basic commands, CREATE TABLE, INSERT, SELECT FROM, GROUP BY, and WHERE, but it was still a nice way to get a better sense for how these commands work.

My implementation basically boils down to keeping a tuple of column-vectors for each table, and all the query operations are just manipulating those vectors.

SELECT just gets the matching column by header-name from the table. If the select query is a julia expression instead of a name, though, that expression is evaluated with the column name replaced with the column value! More on that later...

WHERE creates a filtered copy of the vector, which is used for the rest of the query.

GROUP BY goes through a vector, and creates a new vector of indices (e.g. ["a", "b", "a"] => [1,2,1]), which is then used to split every other column vector into a bunch of row-vectors:

[5, 10, 2] => [[5 2],  # group 1
               [10]]    # group 2

Then finally the query expressions are evaluated over each of those row-columns. So if the above steps came from this query: @SELECT sum(a) @FROM t @GROUP @BY b, then the output column will end up as the first element in each resulting row, creating [7,10]. Basically, like this:

[
    sum([5 2])[1]   # 7
    sum([10])[1]    # 10
   ]

Definitely the coolest thing about this package is evaling the query expressions as native julia, because you can get access to all the powerful julia functions for free!

Here's another example, which showcases that a bit more:

solar_system_objects = read_table_from_disk("solar_system_objects.table")

solar_system_objects   32 rows
body::String
mean_radius::Real
mean_radius_rel::Real
volume::Real
volume_rel::Real
mass::Real
mass_rel::Real
density::Real
surface_gravity::Real
surface_gravity_rel::Real
type_of_object::String
shape::String

using Statistics  # For `mean` and `median`
@SQL begin
    @SELECT mean(density), median(density), round(sum(mass)), type_of_object
    @FROM solar_system_objects
    @WHERE (@. !isnan(mass))  # some elements are missing mass
    @GROUP @BY type_of_object
end

mean(density) | median(density) | round(sum(mass)) |                            type_of_object
--------------|-----------------|------------------|------------------------------------------
        1.408 |           1.408 |       1988550000 |                                      star
       1.0065 |          1.0065 |          2467060 |              planet (gas giant) has rings
        1.454 |           1.454 |           189262 |              planet (ice giant) has rings
     5.029375 |           5.335 |          11814.0 |                      planet (terrestrial)
   2.57784998 |          2.4745 |            393.0 |                      satellite of Jupiter
      1.33316 |           1.236 |            141.0 |                       satellite of Saturn
       3.3464 |          3.3464 |             74.0 |                        satellite of Earth
        2.061 |           2.061 |             22.0 |                      satellite of Neptune
         2.03 |            2.03 |             13.0 |             dwarf planet plutino multiple
         2.52 |            2.52 |             17.0 |                  dwarf planet @SDO binary
      1.59775 |           1.645 |              9.0 |                       satellite of Uranus
         2.55 |            2.55 |              4.0 | dwarf planet resoNaNt @KBO (7:12) trinary
         1.65 |            1.65 |              2.0 |                        satellite of Pluto
          2.2 |             2.2 |              1.0 |                           cubewano binary
        2.077 |           2.077 |              1.0 |                dwarf planet belt asteroid

Getting those complex expressions to work turned out to be a fun exploration of a few of the different types of metaprogramming in julia.

Andy Ferris gave a really nice explanation in his workshop at JuliaCon ("A practical introduction to metaprogramming in Julia") that julia actually does metaprogramming in lots of different ways.

There's macros, sure, which is what we typically think of as metaprogramming. And this package uses macros to let you to write queries in an SQL-like structure. (The macros are all the words starting with @.) But metaprogramming itself is more broad: it just means any time you treat your program itself as data. This concept comes from lisp, and it's central to how julia works: at runtime, julia is parsing every bit of the code you provide it into julia data structures: Exprs, Symbols and literals (ints, strings, etc), and then evaluating those structures.

We can do the same thing ourselves, by creating those Expressions manually, and letting julia eval them! For example:

x = Expr(:call, :+, 2, 3)

:(2 + 3)

eval(x)

5

e = quote
    a = [1,2]; sum(a)
end

quote
    #= In[9]:2 =#
    a = [1, 2]
    #= In[9]:2 =#
    sum(a)
end

eval(e)

3

I used this functionality in this package to automatically implement every imaginable function you could call over your data!

That is, by constructing a simple Expr from your query and letting julia eval it, we get the entire julia language as part of our SQL language implementation, for free! That is so cool! So I didn't need to write a custom function for the sum of a column or the length of a column, because it's just normal julia code.

Which is really amazing, because not only is it easy to implement, it also allows you to be way more expressive with your data than you could ever be in SQL.

Want to know the value of sin for each aisle? cos? How about their comparisons? Why not! It's all just julia.

@SELECT sin.(aisle), cos.(aisle), sin.(aisle) .> cos.(aisle) @FROM groceries

        sin.(aisle) |         cos.(aisle) | sin.(aisle) .> cos.(aisle)
--------------------|---------------------|---------------------------
 0.6569865987187891 |  0.7539022543433046 |                      false
 0.9092974268256817 | -0.4161468365471424 |                       true
 0.9092974268256817 | -0.4161468365471424 |                       true
-0.5365729180004349 |  0.8438539587324921 |                      false
 0.9092974268256817 | -0.4161468365471424 |                       true
-0.7568024953079282 | -0.6536436208636119 |                      false

Or all the items with chocolate?

@SELECT name, id @FROM groceries @WHERE occursin.("Chocolate", name)

               name | id
--------------------|---
Dark Chocolate Bars |  3
    Chocolate syrup |  6

And since these are just normal julia expressions, you can interpolate values into them too:

prices = [2.50, 4.25]  # This is the kind of thing you might normally use a JOIN for.
@SELECT name, quantity, $prices .* quantity @FROM groceries @WHERE occursin.("Chocolate", name)

               name | quantity | [2.5, 4.25] .* quantity
--------------------|----------|------------------------
Dark Chocolate Bars |        2 |                     5.0
    Chocolate syrup |        1 |                    4.25

The way this works, is that @SELECT macro passes those query parameters to the select_from function as Exprs (here).

Then, before it actually evaluates each expression (:(sin.(aisle)) for example), it does two things:

It digs into the expression to find the column name as a Symbol (:aisle).
Then it goes and gets that column from the table (via another select_from call) and replaces the column name with its value: :(sin.([7; 2; 2; 12; 2; 4])).

Now we can let julia evaluate that, and return it as a column. Isn't that cool?

Imagine trying to do something like this in C++... 🤢

As a result, we even get control flow within our queries, for free, even if it is ugly...

@SELECT if (sum(quantity) > 10); ["big..."]; else sum(quantity) end, aisle @FROM groceries @GROUP @BY aisle

if sum(quantity) > 10
    #= In[14]:1 =#
    ["big..."]
else
    #= In[14]:1 =#
    sum(quantity)
end | aisle
------------------------------------------------------------------------------------------------------|------
                                                                                               big... |     7
                                                                                                    1 |     2
                                                                                                    1 |    12
                                                                                                    1 |     4

This was just a fun way to spend a few afternoons, mostly just to try to solidify my understanding of what SQL is doing under the hood.

It's still definitely lacking in a bunch of ways, but it's neat nonetheless! One thing I was considering is whether I should be adding @. to all the expressions by default so you don't have to keep broadcasting everything above, but I left it out to be more explicit.

Sadly, I didn't implement JOIN. I think it wouldn't be very difficult, but I got tired of refactoring my select_from function every time I wanted to merge-in another bit of functionality. :]

In theory, though, a naive implementation would be something like creating new column-vectors for each column that comes out of your select statments, which are the Cartesian Product (×) of those columns from each table, and then doing your group by, on and where over those.

There's probably lots of SQL stuff I didn't implement, but here are some I thought about doing:

JOINs
AUTOINCREMENT PRIMARY KEY
AS
lowercase letterred macros.. lol
specifying only some of the values for INSERT.

BUT, fret not! The julia community is full of so many smart people, that of course other people have already created several awesome packages which do the same sorts of things!

Allow me to take a moment to introduce... Query.jl!

David Anthoff has really done same amazing work for the data scene in julia, and this is just one awesome example.

Query.jl let's you query over just about any type of data with this familiar sort of syntax.

julia> using Query, DataFrames

julia> df = DataFrame(name=["John", "Sally", "Kirk"], age=[23., 42., 59.], children=[3,5,2])

julia> x = @from i in df begin
           @where i.age>50
           @select {i.name, i.children}
           @collect DataFrame
       end
>> 1×2 DataFrames.DataFrame
│ Row │ name   │ children │
├─────┼────────┼──────────┤
│ 1   │ "Kirk" │ 2        │

Looks familiar, right? It's really neat. In fact, more broadly, this idea of adding an SQL-inspired syntax for querying any datasource (not just an SQL table) comes from Microsoft's LINQ, which added this feature to C# more than 10 years ago.

The neat thing about julia's metaprogramming capabilities is that this same concept can be added to Julia via user-space packages, rather than needing to be a core part of the language.

Another approach is in MySQL.jl which allows you to simply communicate with a running MySQL server directly:

using MySQL
using DataFrames

conn = MySQL.connect("localhost", "root", "password", db = "test_db")

foo = MySQL.query(conn, """SELECT COUNT(*) FROM my_first_table;""", DataFrame)
num_foo = foo[1,1]

There are a bunch of other data querying packages in julia, as well, all exploring this concept and taking it in different directions. One example: Query.jl, mentioned above, now lets you mix these query operation concepts into normal code as standalone functions, e.g. peraisle = @groupby(df, _.aisle).

Finally, also, it's worth noting that I haven't really thought about performance at all. From what I've read, I think most SQL implementations work by first creating a query plan which takes into account the entire query, and then execute it.

In contrast, what I've done here sort of does each step as it reads it, which can result in extra work being done than is needed.

I think it's probably more appropriate to call this an SQL intepreter, rather than a proper implementation, since it's evaluating these query statements directly, rather than compiling them into a more appropriate representation (query plan) and running that in the traditional way. But I dunno, those are blurry words in this context, I think.

Hopefully you enjoyed this post! I enjoyed learning more about SQL and practicing metaprogramming in the process! It was lots of fun, and actually, relatively easy!