A dream ORM [WIP]
Wed Aug 23 2023E.W.Ayers
Persistent data is a strange topic.
I am creating a sqlite ORM, dxd. It's only half-built, so some of this is still just a wishlist / working doc.
1. Don't hide the SQL
Dxd comes with a full SQL parser and AST manipulation system. You can take SQL expressions and combine them programmatically. These will get transformed into SQL strings that you can inspect directly. At any point you can take a dxd object and see what SQL query it will send to the DB to get results.
Having the AST comes with some benefits that doing string-only SQL doesn't give you.
1.1. Proper placeholder management
A SQL expression in dxd comes with a sequence of 'placeholder' objects. So when you write db.execute('SELECT from t WHERE x = ?', (4,))
in raw SQL; in dxd this will be compiled to
an AST like SelectClause(TableId('t'), where = Eq(Var('x'), Value(4)))
.
So you don't have to do any more 'question-mark counting' in your queries.
(Dxd also supports named placeholders).
Another advantage of this is different dialects of SQL often use different syntax for placeholder substitutions, so this abstracts that.
1.2. Type checking of queries
SQL doesn't do any type checking of your queries before they run, it just throws errors if you do an illegal move. But sometimes I can end up accidentally comparing a string to an int.
In dxd, these SQL AST objects also perform some basic type inference to try to detect silly errors before the query is run and provide more helpful errors.
Also wherever possible dxd makes sure that all the objects have good Python types.
If a column is an integer, your LSP-enabled editor will infer it has a type like Expr[int]
(or maybe just int
).
It's not perfect, but it's helpful to know what data you are throwing around.
1.3. Scoping SQL variables
Additionally, it will do variable scoping checks on the query. For example if you do SELECT x from t WHERE y NOT IN (SELECT z from t2 WHERE x + z = y)
, it can be ambiguous which tables x
, y
and z
are bound to, particularly scary if t
and t2
share some column names.
This can get really nasty if you are performing joins where the same table is used multiple times.
In the dxd SQL AST expressions, these variables are (tbd) stored as fully-expanded identifiers with proper scoping checks.
1.4. Operator overloading
If you have two query expressions, you can just use the normal python operations +
, *
, ==
etc and it will construct the right SQL-equivalent query expression for you.
It will also throw if you try to use the non-overloadable and
and or
operators.
1.5. Prefer lazy query building
Most ORMs just return the whole data object, I would much rather offer a lazily materialized query that you build up.
So for example you might need to retrieve a row of students
, but each row might have like 50 columns that you don't care about. I only care about the tutor_id
field for my join.
Most ORMs will just return a fully instantiated instance of a Student
object, but instead it should just be something like Query[Student]
that can be plugged into a join step.
We just provide correctly typed helper functions like student.tutor_id : Query[UUID]
.
2. Data conversion
One of the main pains that ORMs solve is converting values to and from a format that the database can understand. This is also database-dependent, sqlite and postgres handle their datatypes very differently.
2.1. Adapters for atomic data
To solve this, dxd extends the 'adapter pattern' and I aim to support sensible-default conversions for all of the commonly used datatypes like UUID
and datetime
.
[todo] describe __adapt__
and __conform__
here
Why use adapt and conform? It's not standard (the PEP got rejected), but it is already used by the built-in sqlite library so users should be familiar with it. It's also sufficiently powerful that it can be overridden correctly.
2.2. Create composite data with 'patterns'
Sometimes you want the output to your query to not just be a tuple of data. I often find myself writing variations on this:
It's tedious. With dxd, any selector clause in a top-level SELECT
statement can be replaced with a 'pattern'.
A pattern can be:
a dictionary with values being patterns
a list or tuple of patterns
a function sending row-tuples to
Patterns are composable and play well with the placeholder-tracking system.
[todo] examples go here. (patterns are implemented)
3. Schemas
Managing schemas and migrations is always a pain.
3.1. Managing schemas in dxd
I'm a big fan of 'infrastructure as code'. Your persistence stores are a kind of infrastructure, so they should be declared in code. Dxd has a set of features for managing the state of your tables by expressing your schemas and tables declaratively as python objects. Dxd will do its best to make sure that these schemas are kept in sync with the actual tables, indexes, views, triggers in the database itself. The dream is that dxd will do simple migrations for you to keep everything in sync, but get out of the way when there is a more serious migration issue and ensure that older versions of the app will still be able to access the db without fighting over exactly what the schema should be.
3.2. Migration is hard
I have a spicy take on migrations, which is that they just irreducibly suck. Changes to a schema are conservative or non-conservative. A conservative change is one where all existing consumers of the data won't be able to tell the difference. This is things like: adding a new column (with a default), adding a new relation.
Conservative migrations are just completely fine and can be automated.
My controversial solution to non-conservative migrations is that you should just not do them (at least with an ORM). If you want to change your schema non-conservatively, you should delete all the tables and start again.
Isn't this really bad? Only in serious production-grade dbs where you need zero downtime. You should have all of your dev/test data in a file that can be loaded into a fresh db, so when the schema changes.
For smaller prod projects, you should have all of your prod data in a backup, ideally this backup is in a WAL-style format like a sequence of events. So, I recommend deleting the old version of the database, and then repopulating it from this backup with the new schema. For the majority of projects this is just fine, just have 20 mins of downtime.
So the only remaining case to worry about is a hot-swapping zero-downtime production system.
In this case, it's just an irreducible engineering problem where you need to really think about the access patterns and what you can get away with during the transition.
You need to have a multi-stage plan where you maintain a transient dual-schema table while you switch over consumers to use the new schema and then only when everything is switched you use the new schema.
I think there might be space for some kind of migration tool can help in situations like this, but it's not going to be in scope for a little Python ORM.
You are just going to have to muck in writing ALTER TABLE
commands directly to the DB.
3.3. Schemas in dxd
With that in mind, lets look at the features dxd offers for defining data models.
You can create a schema for a table by just making a python class with
I haven't completely decided on the exact API. But the principles are:
If you already have classes that are dataclasses or Pydantic
BaseModel
s, or any other structured data library, you should just be able to addSchema
as a mixin and it just works.There are a few things that you need to annotate: what the foreign keys are, what the primary keys are, how defaults work, autoincrementing primary keys. You can do this in a few different ways and I'll pick whatever is least obstructive and clunky.
Use regular-old python type annotations as much as possible.
If you need lots of nitty gritty control over exactly how the
CREATE TABLE
statement gets constructed, you should just write SQL directly.
The main thing that the Schema
mixin gives you is the create_table('my_table')
function that makes the persistent table.
Note that this means you can have multiple tables with the same schema. If you skip the name it will give it a default name Users_table
.
The system will warn you if there is already a table with the given name but with an incompatible schema (ie one schema is not a conservative migration of the other).
You can either specify details like primary-key, constraints on the Schema or in the create_table
function.
Additionally, in the create_table
function you need to specify which tables the foreign keys in this table are referring to.
Dxd will do some checks to make sure that the foreign keys are valid.
You can also do schema-intersections A & B
: all the fields from both A
and B
must be present.
3.4. Join fields
Often you want one column in a table to refer to an entry in another table. The standard way to do this is by storing foreign keys.
But sometimes when you are rapid-prototyping you don't really want to faff around and would rather have something closer to Python's data model. Dxd offers a few ways of expressing these kinds of relations that go beyond foreign keys (but under the hood they all resolve to foreign keys).
These column types abstract over the various ways that you might need to access data that would typically be done through joins in SQL.
x : Subtable[S]
whereS <: Schema
creates an auxilliary table with primary key being the concatenation of the parent table's primary key andS
's primary key.Reference[S]
is a foreign key reference to another table in the database, when the table is created, you need to pass the name of the referenced table.JoinDict[K,V]
is a subclass ofdict
. When you make a table where one of the columns has the typeJoinDict
, an auxilliary table with primary key being the primary key of the parent table andK
.JoinList[V]
is a list of items stored on an auxiliary table, similar toJoinDict
(except the key is now an integer and a contiguity constraint ensuring that all entries exist within a certain range)Inline[S]
will put the fields ofS
as columns on the parent table. It should behave the same asSubtable
but without the need to perform a join.
The dream is you can also write custom composite join-field types and provide the machinery for creating and looking these up. When you create queries with these statements, they will be compiled to SQL queries (not run in python). The point is that although you have all these wacky abstractions, at the end of the day they all cash out as SQL that you can just look at. All of these features are optional and you can always fall back to using the SQL directly.
3.5. Standalone persistent objects
Sometimes you just want a dictionary that is persistent. Dxd comes with a set of persistent out-of-the-box datastructures that you can use. If you are familiar with the DiskCache library you will get the idea.
This code will create a sqlite table 'named my-pdict' that will behave like a dictionary, but any changes will be instantly persisted to disk.
I think the exact location that the table is stored on disk should be configurable with a ContextVar, but the default is some sensible app-storage area.
There are a few other structures you can use:
priority queue
set
list
append-only event stream
Each persistent datastructure also comes with transaction primitives, so you can start a transaction to atomically make changes to the structure.
Some atomic operations are built in, like pdict.modify(f : V → V)
.
3.6. More complicated persistent objects
I also want to offer some more advanced persistence systems. Something I'm really excited about working on is having versioning and merging primitives that support conflict resolution.
For example, there could be a PersistentTextDoc
type which is just a string or maybe a structured rich text tree.
Multiple users could modify this locally, send back to server and then the persistence layer on the server could receive two conflicting versions.
Both of these would be persisted and a conflict resolution merge would be provided.
How a merge should occur is usually an app-specific choice, so giving the users the power to create their own persistent objects and composing them out of simpler ones seems the right way to go here.
Any of these kinds of objects should also be allowed to live as a column in a schema.
4. Queries
Current plan is queries behave the same as in Pandas and similar dataframe manipulation libraries like Polars.
Unlike pandas, the queries are lazy, at any point you can run q.to_sql
and see the generated sql.
They only get sent to db when you do something like iter
them.
There is a cool library called pandas-to-sql
that this is related to.
But also, all this stuff is optional and you can just use normal SQL queries too and mix and match.
All of the query syntax I introduce here gets transparently compiled to SQL that you can read and modify.
returns an iterable of Blob
objects. The query will only be executed once iter
is called on it. But you can also select particular columns.
We also provide a map
and select
operator. The below examples return the same thing, but map
will be run locally, whereas select
will wrap it as a query. The object in the select statement is called a 'pattern'.
As a shorthand, you can also write table[filter][(Blob.user_id, Blob.digest)]
similarly to pandas.
But you should prefer select
because then we can give correct type hinting.
loc
,
head
, tail
,
nlargest
,
groupby
, agg
are also available and work the same as in pandas.
Joins are done using the merge
operator.
Filtering joins:
4.1. Crazy joins with product tables.
But I never liked this approach, so you can also do joins with 'multi table predicates'. These are predicates that map multiple table indices to bools. They can be used to create joins.
Iterating bxu
would do the same as itertools.product(blobs, users)
.
Iterating y
returns pairs of blobs and users with matching user ids.
But it would compile y
to a query, so what actually runs is a join on the database.
You can also write dxd.product({"blob": blobs, "users": users})
to instead have a named product table. Then you would write y['blobs']
instead of y[0]
.
The idea of this approach is that you no longer need to clunkily distinguish between left, right, inner, outer joins.
You just choose a predicate over the product table, and you iterate or select on the resulting set.
4.2. Mutating the data
You can mutate things the same way you would in pandas.
We provide update
and insert
operators.
Inserting is tricky because it's not clear what the behaviour should be if there is a clash, or even what constitutes a clash. Sometimes there is a conflict because the primary key is already present, but it can also be because some other validation constraint is violated. Eg there could be a demand for uniqueness of a field.
In sqlite and others, the way this is managed is with various constraints on the tables, and an ON CONFLICT
clause saying how the update should happen.
The general formula for resolving a conflict is to take the existing entry, and the new entry and perform a merge V → V → V
.
ignore := | a _ => a
replace := | _ b => b
-- also known as 'upsert'fail := | a b => raise
update := | a b => f a b
5. Application-level features
5.1. Notifications
A frustratingly absent feature from sqlite is the NOTIFY
keyword from Postgresql. If you have multiple processes reading and writing to your sqlite file, it can be a pain to get processes to react to changes in the DB without using polling loops.
I'm experimenting with using ZMQ to simulate a NOTIFY
.
5.2. Auth
Auth is such a common requirement of apps that I think it makes sense to include some helpers for using OAuth2, OpenID and so on. Ideally it should plug directly into libraries like FastAPI.
Plugs right in with OAuth2. You just put your client ids and secrets in a config file and then it gives you the standard endpoints.
User management out of the box. You can 'sign in' using the usual OAuth2 methods.
User identity resolution: most providers like google, github have some extra data you can get like a URL to a profile avatar, known email addresses, phone numbers, addresses. It should just support this kind of identity resolution out of the box.
Api-key management. Methods for working with JWT jti-based api keys.
Eventually some support for roles, groups, permissions etc.
5.3. row level security
Another missing feature of sqlite is row-level security (ie the POLICY
statements in Postgresql).
Given an auth session, I should be able to just write queries over a table and have them automatically filtered to the rows that are able to be accessed by the authenticated user.
A big source of security bugs is not including a where user_id == ?
check in your queries where you don't have RLS set up.
You can implement this in sqlite by using the auth-callback system, but it needs to be wired up to behave as usefully as a POLICY
statement.
5.4. Billing / resources
Another common pattern is the ability to manage the resources that a user is consuming.
users can have credit balances
App code can be gated by subtracting a certain amount from users balance. All of these kinds of transactions are kept in an auditing table. I see this working as a python decorator, where you only get charged if the code exits the
with
block without erroring.free quotas
Common budgetting tools for users like setting limits, monthly limits etc.
Similar for storage billing, eg you can take each row in a table and charge an amount per unit time that the row exists for.