A dream ORM [WIP]

Wed Aug 23 2023E.W.Ayers

Persistent data is a strange topic.

I am creating a sqlite ORM, dxd. It's only half-built, so some of this is still just a wishlist / working doc.

1. Don't hide the SQL

Dxd comes with a full SQL parser and AST manipulation system. You can take SQL expressions and combine them programmatically. These will get transformed into SQL strings that you can inspect directly. At any point you can take a dxd object and see what SQL query it will send to the DB to get results.

Having the AST comes with some benefits that doing string-only SQL doesn't give you.

1.1. Proper placeholder management

A SQL expression in dxd comes with a sequence of 'placeholder' objects. So when you write db.execute('SELECT from t WHERE x = ?', (4,)) in raw SQL; in dxd this will be compiled to an AST like SelectClause(TableId('t'), where = Eq(Var('x'), Value(4))). So you don't have to do any more 'question-mark counting' in your queries. (Dxd also supports named placeholders). Another advantage of this is different dialects of SQL often use different syntax for placeholder substitutions, so this abstracts that.

1.2. Type checking of queries

SQL doesn't do any type checking of your queries before they run, it just throws errors if you do an illegal move. But sometimes I can end up accidentally comparing a string to an int. In dxd, these SQL AST objects also perform some basic type inference to try to detect silly errors before the query is run and provide more helpful errors. Also wherever possible dxd makes sure that all the objects have good Python types. If a column is an integer, your LSP-enabled editor will infer it has a type like Expr[int] (or maybe just int). It's not perfect, but it's helpful to know what data you are throwing around.

1.3. Scoping SQL variables

Additionally, it will do variable scoping checks on the query. For example if you do SELECT x from t WHERE y NOT IN (SELECT z from t2 WHERE x + z = y), it can be ambiguous which tables x, y and z are bound to, particularly scary if t and t2 share some column names. This can get really nasty if you are performing joins where the same table is used multiple times. In the dxd SQL AST expressions, these variables are (tbd) stored as fully-expanded identifiers with proper scoping checks.

1.4. Operator overloading

If you have two query expressions, you can just use the normal python operations +, *, == etc and it will construct the right SQL-equivalent query expression for you. It will also throw if you try to use the non-overloadable and and or operators.

1.5. Prefer lazy query building

Most ORMs just return the whole data object, I would much rather offer a lazily materialized query that you build up. So for example you might need to retrieve a row of students, but each row might have like 50 columns that you don't care about. I only care about the tutor_id field for my join. Most ORMs will just return a fully instantiated instance of a Student object, but instead it should just be something like Query[Student] that can be plugged into a join step. We just provide correctly typed helper functions like student.tutor_id : Query[UUID].

2. Data conversion

One of the main pains that ORMs solve is converting values to and from a format that the database can understand. This is also database-dependent, sqlite and postgres handle their datatypes very differently.

2.1. Adapters for atomic data

To solve this, dxd extends the 'adapter pattern' and I aim to support sensible-default conversions for all of the commonly used datatypes like UUID and datetime.

[todo] describe __adapt__ and __conform__ here

Why use adapt and conform? It's not standard (the PEP got rejected), but it is already used by the built-in sqlite library so users should be familiar with it. It's also sufficiently powerful that it can be overridden correctly.

2.2. Create composite data with 'patterns'

Sometimes you want the output to your query to not just be a tuple of data. I often find myself writing variations on this:

items = db.execute('SELECT (a,b,c) FROM t;').fetchall()
items = [{'a' : a, 'b' : b, 'c' : c} for a,b,c in items]

It's tedious. With dxd, any selector clause in a top-level SELECT statement can be replaced with a 'pattern'. A pattern can be:

Patterns are composable and play well with the placeholder-tracking system.

[todo] examples go here. (patterns are implemented)

3. Schemas

Managing schemas and migrations is always a pain.

3.1. Managing schemas in dxd

I'm a big fan of 'infrastructure as code'. Your persistence stores are a kind of infrastructure, so they should be declared in code. Dxd has a set of features for managing the state of your tables by expressing your schemas and tables declaratively as python objects. Dxd will do its best to make sure that these schemas are kept in sync with the actual tables, indexes, views, triggers in the database itself. The dream is that dxd will do simple migrations for you to keep everything in sync, but get out of the way when there is a more serious migration issue and ensure that older versions of the app will still be able to access the db without fighting over exactly what the schema should be.

3.2. Migration is hard

I have a spicy take on migrations, which is that they just irreducibly suck. Changes to a schema are conservative or non-conservative. A conservative change is one where all existing consumers of the data won't be able to tell the difference. This is things like: adding a new column (with a default), adding a new relation.

Conservative migrations are just completely fine and can be automated.

My controversial solution to non-conservative migrations is that you should just not do them (at least with an ORM). If you want to change your schema non-conservatively, you should delete all the tables and start again.

Isn't this really bad? Only in serious production-grade dbs where you need zero downtime. You should have all of your dev/test data in a file that can be loaded into a fresh db, so when the schema changes.

For smaller prod projects, you should have all of your prod data in a backup, ideally this backup is in a WAL-style format like a sequence of events. So, I recommend deleting the old version of the database, and then repopulating it from this backup with the new schema. For the majority of projects this is just fine, just have 20 mins of downtime.

So the only remaining case to worry about is a hot-swapping zero-downtime production system. In this case, it's just an irreducible engineering problem where you need to really think about the access patterns and what you can get away with during the transition. You need to have a multi-stage plan where you maintain a transient dual-schema table while you switch over consumers to use the new schema and then only when everything is switched you use the new schema. I think there might be space for some kind of migration tool can help in situations like this, but it's not going to be in scope for a little Python ORM. You are just going to have to muck in writing ALTER TABLE commands directly to the DB.

3.3. Schemas in dxd

With that in mind, lets look at the features dxd offers for defining data models.

You can create a schema for a table by just making a python class with

from dxd import Schema
class User(Schema):
email : Email
id : UUID = Field(default_factory = uuid4)
birthday : date
class Config:
primary_key = "id"
users : Table[User] = Table.create(User, 'users')
# user.birthday : Column[date]
users.insert(User(email="joe@example.com", birthday = date('2010-01-01')))
# all queries and operations use the same syntax as pandas.
# User.id == "foo : Q"
r : View[User] = users[User.birthday > date('2009-01-01')]

I haven't completely decided on the exact API. But the principles are:

The main thing that the Schema mixin gives you is the create_table('my_table') function that makes the persistent table. Note that this means you can have multiple tables with the same schema. If you skip the name it will give it a default name Users_table. The system will warn you if there is already a table with the given name but with an incompatible schema (ie one schema is not a conservative migration of the other). You can either specify details like primary-key, constraints on the Schema or in the create_table function.

Additionally, in the create_table function you need to specify which tables the foreign keys in this table are referring to. Dxd will do some checks to make sure that the foreign keys are valid.

You can also do schema-intersections A & B: all the fields from both A and B must be present.

3.4. Join fields

Often you want one column in a table to refer to an entry in another table. The standard way to do this is by storing foreign keys.

But sometimes when you are rapid-prototyping you don't really want to faff around and would rather have something closer to Python's data model. Dxd offers a few ways of expressing these kinds of relations that go beyond foreign keys (but under the hood they all resolve to foreign keys).

These column types abstract over the various ways that you might need to access data that would typically be done through joins in SQL.

The dream is you can also write custom composite join-field types and provide the machinery for creating and looking these up. When you create queries with these statements, they will be compiled to SQL queries (not run in python). The point is that although you have all these wacky abstractions, at the end of the day they all cash out as SQL that you can just look at. All of these features are optional and you can always fall back to using the SQL directly.

3.5. Standalone persistent objects

Sometimes you just want a dictionary that is persistent. Dxd comes with a set of persistent out-of-the-box datastructures that you can use. If you are familiar with the DiskCache library you will get the idea.

from dxd import pdict
x : pdict[str, ing] = pdict('my-pdict')
# this will get
x['hello'] = 5

This code will create a sqlite table 'named my-pdict' that will behave like a dictionary, but any changes will be instantly persisted to disk.

I think the exact location that the table is stored on disk should be configurable with a ContextVar, but the default is some sensible app-storage area.

There are a few other structures you can use:

Each persistent datastructure also comes with transaction primitives, so you can start a transaction to atomically make changes to the structure. Some atomic operations are built in, like pdict.modify(f : VV).

3.6. More complicated persistent objects

I also want to offer some more advanced persistence systems. Something I'm really excited about working on is having versioning and merging primitives that support conflict resolution.

For example, there could be a PersistentTextDoc type which is just a string or maybe a structured rich text tree. Multiple users could modify this locally, send back to server and then the persistence layer on the server could receive two conflicting versions. Both of these would be persisted and a conflict resolution merge would be provided.

How a merge should occur is usually an app-specific choice, so giving the users the power to create their own persistent objects and composing them out of simpler ones seems the right way to go here.

Any of these kinds of objects should also be allowed to live as a column in a schema.

4. Queries

Current plan is queries behave the same as in Pandas and similar dataframe manipulation libraries like Polars. Unlike pandas, the queries are lazy, at any point you can run q.to_sql and see the generated sql. They only get sent to db when you do something like iter them. There is a cool library called pandas-to-sql that this is related to. But also, all this stuff is optional and you can just use normal SQL queries too and mix and match. All of the query syntax I introduce here gets transparently compiled to SQL that you can read and modify.

table[(Blob.user_id == user.id) & (Blob.digest == digest)] # : View[Blob]

returns an iterable of Blob objects. The query will only be executed once iter is called on it. But you can also select particular columns.

filter = (Blob.user_id == user.id) & (Blob.digest == digest) # Predicate[Blob]
table[filter].content_length # : Series[int]

We also provide a map and select operator. The below examples return the same thing, but map will be run locally, whereas select will wrap it as a query. The object in the select statement is called a 'pattern'.

table[filter].select( (Blob.user_id, Blob.digest)) # : View[tuple[UUID, str]]
table[filter].map(lambda b: (b.user_id, b.digest)) # : View[tuple[UUID, str]] ← materialized locally

As a shorthand, you can also write table[filter][(Blob.user_id, Blob.digest)] similarly to pandas. But you should prefer select because then we can give correct type hinting.

loc, head, tail, nlargest,

groupby, agg are also available and work the same as in pandas.

table.groupby(Blob.user_id) # : Grouping[UUID, Blob]
table.groupby(Blob.user_id).content_length # : Grouping[UUID, int]
table.groupby(Blob.user_id).content_length.sum() # : int

Joins are done using the merge operator.

blobs : Table[Blob]
users : Table[User]
dxd.merge(blobs, users, how='left', on=(Blob.user_id, User.id))

Filtering joins:

# get orphaned blobs
blobs[~ blobs[Blob.user_id].isin(users[User.id])]

4.1. Crazy joins with product tables.

But I never liked this approach, so you can also do joins with 'multi table predicates'. These are predicates that map multiple table indices to bools. They can be used to create joins.

pp = Blob.user_id == User.id # MultiPredicate[Blob, User]
bxu = dxd.join(blobs, users) # : Join[Blob, User]
y = bxu[Blob.user_id == User.id] # : Join[Blob, User]
# which is shorthand for
y = bxu[bxu[0].user_id == bxu[1].id]

Iterating bxu would do the same as itertools.product(blobs, users). Iterating y returns pairs of blobs and users with matching user ids. But it would compile y to a query, so what actually runs is a join on the database.

y.select((y[0].digest, y[1].username))

You can also write dxd.product({"blob": blobs, "users": users}) to instead have a named product table. Then you would write y['blobs'] instead of y[0]. The idea of this approach is that you no longer need to clunkily distinguish between left, right, inner, outer joins. You just choose a predicate over the product table, and you iterate or select on the resulting set.

4.2. Mutating the data

You can mutate things the same way you would in pandas.

# modify all
table.attempts += 1
# modify only certain rows
table[filter].attempts += 1

We provide update and insert operators.

Blob.attempts : Blob.attempts + 1
Blob.last_accessed : datetime.utcnow()

Inserting is tricky because it's not clear what the behaviour should be if there is a clash, or even what constitutes a clash. Sometimes there is a conflict because the primary key is already present, but it can also be because some other validation constraint is violated. Eg there could be a demand for uniqueness of a field.

In sqlite and others, the way this is managed is with various constraints on the tables, and an ON CONFLICT clause saying how the update should happen.

The general formula for resolving a conflict is to take the existing entry, and the new entry and perform a merge VVV.

table.insert(Blob, on_conflict = 'ignore')

5. Application-level features

5.1. Notifications

A frustratingly absent feature from sqlite is the NOTIFY keyword from Postgresql. If you have multiple processes reading and writing to your sqlite file, it can be a pain to get processes to react to changes in the DB without using polling loops. I'm experimenting with using ZMQ to simulate a NOTIFY.

5.2. Auth

Auth is such a common requirement of apps that I think it makes sense to include some helpers for using OAuth2, OpenID and so on. Ideally it should plug directly into libraries like FastAPI.

5.3. row level security

Another missing feature of sqlite is row-level security (ie the POLICY statements in Postgresql). Given an auth session, I should be able to just write queries over a table and have them automatically filtered to the rows that are able to be accessed by the authenticated user. A big source of security bugs is not including a where user_id == ? check in your queries where you don't have RLS set up. You can implement this in sqlite by using the auth-callback system, but it needs to be wired up to behave as usefully as a POLICY statement.

5.4. Billing / resources

Another common pattern is the ability to manage the resources that a user is consuming.

6. Notes, drafts