A Tool for Producing Verified, Explainable Proofs.
Edward William Ayers
Corpus Christi College
University of Cambridge
Submission Date: 2021-09-06

This thesis is submitted for the degree of Doctor of Philosophy.

This thesis is the result of my own work and includes nothing which is the outcome of work done in collaboration except as declared in the preface and specified in the text. It is not substantially the same as any work that has already been submitted before for any degree or other qualification except as declared in the preface and specified in the text. It does not exceed the prescribed word limit for the Mathematics Degree Committee.


Mathematicians are reluctant to use interactive theorem provers. In this thesis I argue that this is because proof assistants don't emphasise explanations of proofs; and that in order to produce good explanations, the system must create proofs in a manner that mimics how humans would create proofs. My research goals are to determine what constitutes a human-like proof and to represent human-like reasoning within an interactive theorem prover to create formalised, understandable proofs. Another goal is to produce a framework to visualise the goal states of this system.

To demonstrate this, I present HumanProof: a piece of software built for the Lean 3 theorem prover. It is used for interactively creating proofs that resemble how human mathematicians reason. The system provides a visual, hierarchical representation of the goal and a system for suggesting available inference rules. The system produces output in the form of both natural language and formal proof terms which are checked by Lean's kernel. This is made possible with the use of a structured goal state system which interfaces with Lean's tactic system which is detailed in Chapter 3.

In Chapter 4, I present the subtasks automation planning subsystem, which is used to produce equality proofs in a human-like fashion. The basic strategy of the subtasks system is break a given equality problem in to a hierarchy of tasks and then maintain a stack of these tasks in order to determine the order in which to apply equational rewriting moves. This process produces equality chains for simple problems without having to resort to brute force or specialised procedures such as normalisation. This makes proofs more human-like by breaking the problem into a hierarchical set of tasks in the same way that a human would.

To produce the interface for this software, I also created the ProofWidgets system for Lean 3. This system is detailed in Chapter 5. The ProofWidgets system uses Lean's metaprogramming framework to allow users to write their own interactive, web-based user interfaces to display within the VSCode editor and in an online web-editor. The entire tactic state is available to the rendering engine, and hence expression structure and types of subexpressions can be explored interactively. The ProofWidgets system also allows the user interface to interactively edit the proof document, enabling a truly interactive modality for creating proofs; human-like or not.

In Chapter 6, the system is evaluated by asking real mathematicians about the output of the system, and what it means for a proof to be understandable to them. The user group study asks participants to rank and comment on proofs created by HumanProof alongside natural language and pure Lean proofs. The study finds that participants generally prefer the HumanProof format over the Lean format. The verbal responses collected during the study indicate that providing intuition and signposting are the most important properties of a proof that aid understanding.


Chapter 1

My first contact with the ideas of formalised mathematics came from reading the anonymously authored QED Manifesto [Ano94[Ano94]AnonymousThe QED manifesto (1994)Automated Deduction--CADE(link)]In this thesis, shortened citation references will appear in the sidebar, a full bibliography with all reference details is provided at the end of the document. Some sidebar citations will be omitted if there is not enough space. which envisions a 'QED system' in which all mathematical knowledge is stored in a single, computer-verified repository. This idea dizzied me: perhaps review of mathematics will amount to remarking on style and interest, with checking of proofs performed automatically from a machine readable document.

The general term that I will use for software that works towards this vision is proof assistant or Interactive Theorem Prover ITP. A proof assistant at its most general is a piece of software that allows users to create and verify mathematical proofs. In Section 2.1 I will provide more detail how proof assistants are generally constructed.

In 2007, Freek Wiedijk [Wie07[Wie07]Wiedijk, FreekThe QED manifesto revisited (2007)Studies in Logic, Grammar and Rhetoric(link)] pronounced the QED project to have "not been a success (yet)", citing not enough people working on formalised mathematics and the severe differences between formalised and 'real' mathematics, both at a syntactic level (formalised mathematics resembles source code) and at a foundational level (formalised mathematics is usually constructive and procedural as opposed to classical and declarative). Similarly, Alan Bundy [Bun11[Bun11]Bundy, AlanAutomated theorem provers: a practical tool for the working mathematician? (2011)Annals of Mathematics and Artificial Intelligence(link)] notes that although mathematicians have readily adopted computational tools such as [Knu86[Knu86]Knuth, Donald E.The TeXbook (1986)publisher Addison-Wesley] and computer algebra systemsA computer algebra system (CAS) is a tool for symbolic manipulation of formulae and expressions, without necessarily having a formalised proof that the manipulation is sound. Examples of CASes include Maple and Mathematica., computer aided proving has had very little impact on the workflow of a working mathematician. Bundy cites several reasons for this which will be discussed in Section 1.1.

Now, a decade later, the tide may be turning. In 2021, proof assistants are pretty good. There are several well-supported large-scale systems such as Isabelle [Pau89], Coq [Coq], Lean [MKA+15], HOL Light [Har09], Agda [Nor08], Mizar [GKN15], PVS [SORS01] and many more. These systems are used to define and prove mathematical facts in a variety of logics (e.g. FOL, HOL, CIC, univalent foundations). These systems are bridged to powerful, automated reasoning systems (e.g. Vampire [RV02], Z3 [MB08], E [SCV19] and Leo-III [SB18a]. Within these systems, many theorems big and small (4-colour theorem [Gon08], Feit-Thompson theorem [GAA+13], Kepler conjecture [HAB+17]) have been proved in a variety of fields, accompanied by large mathematical libraries (Isabelle's Archive of Formal Proofs, Lean's mathlib, Coq's Mathematical Components, Mizar's Formalized Mathematics) whose intersection with undergraduate and research level mathematics is steadily growingSee, for example, the rate of growth of the Lean 3 mathematical library https://leanprover-community.github.io/mathlib_stats.html..

However, in spite of these advances, we are still yet to see widespread adoption of ITP by mathematicians outside of some (growing) cliques of enthusiasts. In this thesis I wish to address this problem through engaging with how mathematicians use and understand proofs to create new ways of interacting with formalised proof. Let's first expand on the problem a little more and then use this to frame the research questions that I will tackle for the remainder of the thesis.

1.1. Mathematicians and proof assistants

Here I offer 3 possible explanations for why mathematicians have not adopted proof assistants. Many have commented on these before: Bundy [Bun11] summarises the main challenges well.

1. Differing attitudes towards correctness and errors. Mathematicians don't worry about mistakes in the same way as proof assistants doI will present some evidence for this in Section 2.5.. Mathematicians care deeply about correctness, but historically the dynamics determining whether a result is considered to be true are also driven by sociological mechanisms such as peer-review; informal correspondences; 'folk' lemmas and principles; reputation of authors; and so on [MUP79[MUP79]de Millo, Richard A; Upton, Richard J; Perlis, Alan JSocial processes and proofs of theorems and programs (1979)Communications of the ACM(link)]. A proxy for trustworthiness of a result is the number of other mathematicians that have scrutinized the work. That is, if the proof is found on an undergraduate curriculum, you can predict with a high degree of confidence that any errors in the proof will be brought to the lecturer's attention. In contrast, a standalone paper that has not yet been used for any subsequent work by others is typically treated with some degree of caution.

2. High cost. Becoming proficient in an ITP system such as Isabelle or Coq can require a lot of time. And then formalising an area of maths can take around ten times the amount of time required to write a corresponding paper or textbook on the topic. This time quickly balloons if it is also necessary to write any underlying assumed knowledge of the topic (e.g., measure theory first requires real analysis). This 'loss factor' of the space cost of developing a formalised proof over that of a natural language proof was first noted by de Bruijn in relation to his AUTOMATH prover [DeB80[DeB80]De Bruijn, Nicolaas GovertA survey of the project AUTOMATH (1980)To H.B.Curry: Essays on Combinatory Logic,Lambda Calculus and Formalism(link)]. De Bruijn estimates a factor of 20 for AUTOMATH, and Wiedijk later estimates this factor to be closer to three or four in Mizar [Wie00[Wie00]Wiedijk, FreekThe de Bruijn Factor (2000)http://www.cs.ru.nl/F.Wiedijk/factor/factor.pdf]. There are costs other than space too, the main one of concern here being the time to learn to use the tools and the amount of work required per proof.

3. Low reward. What does a mathematician have to gain from formalising their research? In many cases, there is little to gain other than confirming something the researcher knew to be true anyway. The process of formalisation may bring to light 'bugs' in the work: perhaps there is a trivial case that wasn't accounted for or an assumption needs to be strengthened. Sometimes the reward is high enough that there is a clear case for formalisation, particularly when the proof involves some computer-generated component. This is exemplified by Hales' proof [Hal05[Hal05]Hales, Thomas CA proof of the Kepler conjecture (2005)Annals of mathematics(link)] and later formalised proof [HAB+17[HAB+17]Hales, Thomas C; Adams, Mark; Bauer, Gertrud; et al.A formal proof of the Kepler conjecture (2017)Forum of Mathematics, Pi(link)] of the Kepler conjecture. The original proof involved lengthy computer generated steps that were difficult for humans to check, and so Hales led the Flyspeck project to formalise it, taking 21 collaborators around a decade to complete. Another celebrated example is Gonthier's formalisation of the computer-generated proof of the four-colour theorem [Gon08[Gon08]Gonthier, GeorgesFormal proof--the four-color theorem (2008)Notices of the AMS(link)]. Formalisation is also used regularly in formalising expensive hardware and safety-critical computer software (e.g., [KEH+09[KEH+09]Klein, Gerwin; Elphinstone, Kevin; Heiser, Gernot; et al.seL4: Formal verification of an OS kernel (2009)Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles(link), Pau98[Pau98]Paulson, Lawrence CThe inductive approach to verifying cryptographic protocols (1998)Journal of Computer Security(link)]).

The economics of the matter are such that the gains of using ITP are too low compared to the benefits for the majority of cases. Indeed, since mathematicians have a different attitude to correctness, there are sometimes no benefits to formalisation. As ITP developers, we can improve the situation by either decreasing the learning cost or increasing the utility.

How can we make ITP easier to learn? One way is to teach it in undergraduate mathematics curricula (not just computer science). An example of such a course is Massot's Introduction aux mathématiques formalisées taught at the Université Paris Sud. Another way is to improve the usability of the user interface for the proof assistant; I will consider this point in more detail in Chapter 5.

How can we increase the utility that mathematicians gain from using a proof assistant? In this thesis I will argue that one way to help with these three issues is to put more emphasis on interactive theorem provers providing explanations rather than a mere guarantee of correctness. We can see that explanations are important because mathematicians care about new proofs of old results that treat the problem in a new way. Proofs from the Book [AZHE10[AZHE10]Aigner, Martin; Ziegler, Günter M; Hofmann, Karl H; et al.Proofs from the Book (2010)publisher Springer(link)] catalogues some particularly lovely examples of this.

Can computers also provide informal proofs with more emphasis on explanations? Gowers [Gow10[Gow10]Gowers, W. T.Rough structure and classification (2010)Visions in Mathematics(link) §2] presents an imagined interaction between a mathematician and a proof assistant of the future.

Quotation 1.1

Excerpt from an imagined conversation between a mathematician and a computer from [Gow10 §2].

Mathematician. Is the following true? Let . Then for sufficiently large, every set of size at least contains a subset of the form ?

Computer. Yes. If is non-empty, choose and set .

M. All right all right, but what if is not allowed to be zero?

C. Have you tried induction on , with some tending to zero?

M. That idea is no help at all. Give me some examples please.

C. The obvious greedy algorithm gives the set

An interesting feature of this conversation is that the status of the formal correctness of any of the statements conveyed by the computer is not mentioned. Similar notions are brought to light in the work of Corneli et al. [CMM+17[CMM+17]Corneli, Joseph; Martin, Ursula; Murray-Rust, Dave; et al.Modelling the way mathematics is actually done (2017)Proceedings of the 5th ACM SIGPLAN International Workshop on Functional Art, Music, Modeling, and Design(link)] in their modelling of informal mathematical dialogues and exposition.

Why not have both explanatory and verified proofs? I suspect that if an ITP system is to be broadly adopted by mathematicians, it must concisely express theorems and their proofs in a way similar to that which a mathematician would communicate with fellow mathematicians. This not only requires constructing human-readable explanations, but also a reimagining of how the user can interact with the prover.

In this thesis, I will focus on problems that are considered 'routine' for a mathematician. That is, problems that a mathematician would typically do 'on autopilot' or by 'following their nose' For example, showing that from the ring axioms.. I choose to focus on this class of problem because I believe it is an area where ITP could produce proofs that explain why they are true rather than merely provide a certificate of correctness. The typical workflow when faced with a problem like this is to either meticulously provide a low-level proof or apply automation such as Isabelle's auto, or an automation orchestration tool such as Isabelle's Sledgehammer [BN10[BN10]Böhme, Sascha; Nipkow, TobiasSledgehammer: judgement day (2010)International Joint Conference on Automated Reasoning(link)]. In the case of using an automation tacticBroadly, a tactic is a program for creating proofs. I will drill down on how this works in Chapter 2. like auto the tactic will either fail or succeed, leaving the user with little feedback on why the result is true. There are some tools for producing intelligible proofs from formalised ones, for example, the creation of Isar [Wen99[Wen99]Wenzel, MarkusIsar - A Generic Interpretative Approach to Readable Formal Proof Documents (1999)Theorem Proving in Higher Order Logics(link)] proofs from Sledgehammer by Blanchette et al. [BBF+16[BBF+16]Blanchette, Jasmin Christian; Böhme, Sascha; Fleury, Mathias; et al.Semi-intelligible Isar proofs from machine-generated proofs (2016)Journal of Automated Reasoning(link)]. However, gaining an intuition for a proof will be easier if the proof is generated in a way that reflects how a human would solve the problem, and so translating a machine proof to a proof which a human will extract meaning from is an uphill battle.

1.1.1. Types of understandability

The primary motivation of the work in this thesis is to help make ITP systems more appealing to mathematicians. The approach I chosen to take towards this is to research ways of making ITP systems more understandable. There are many components of ITP that I consider with respect to understandability:

The different parts of my thesis will address different sets of these ways in which a proof assistant can be understandable. With respect to the automation and underlying-representation aspects of understandability, we will see in Section 2.6 that there is some debate over whether prover automation needs to be easy to follow for a human or not (machine-like vs. human-like). In this thesis I take a pragmatic stance that the understandability of automation and underlying-representation need not be human-like provided that the resulting interaction and output is understandable. However, as I investigate in Chapter 4, there may be ways of creating automation that are more conducive to creating understandable output and interaction.

1.2. Research questions

In the context of these facets of an understandable ITP system, there arise some key research questions that I seek to study.

Question 1. What constitutes a human-like, understandable proof?


Question 2. How can human-like reasoning be represented within an interactive theorem prover to produce formalised, understandable proofs?


Question 3. How can this mode of human-like reasoning be presented to the user in an interactive, multimodal way?


1.3. Contributions

This thesis presents a number of contributions towards the above research questions:

  1. An abstract calculus for developing human-like proofs (Chapter 3).

  2. An interface between this abstraction layer and a metavariable-driven tactic state, as is used in theorem provers such as Coq and Lean, producing formally verified proofs (Chapter 3 and Appendix A).

  3. A procedure for generating natural language proofs from this calculus (Section 3.6).

  4. The 'subtasks' algorithm, a system for automating the creation of chains of equalities and inequalities. This work has been published in [AGJ19[AGJ19]Ayers, E. W.; Gowers, W. T.; Jamnik, MatejaA human-oriented term rewriting system (2019)KI 2019: Advances in Artificial Intelligence - 42nd German Conference on AI(link)] (Chapter 4).

  5. A graphical user interface framework for interactive theorem provers (Chapter 5). This has been published in [AJG21[AJG21]Ayers, E. W.; Jamnik, Mateja; Gowers, W. T.A graphical user interface framework for formal verification (2021)Interactive Theorem Proving(link)].

  6. An implementation of all of the above contributions in the Lean 3 theorem prover.

  7. A study assessing the impact of natural language proofs with practising mathematicians (Chapter 6).

The implementations for these contributions can be found at the following links:

1.4. Structure of this document

In Chapter 2, I will provide an overview of the background material needed for the remaining chapters. Next, in Chapter 3, I introduce the HumanProof software for producing human-like proofs within the Lean proof assistant. I provide motivation of the design in Section 3.1, an overview of the system in Section 3.2 and then dive in to the details of how the system is designed, including the natural-language generation engine in Section 3.6. Chapter 4 discusses a system for producing equational reasoning proofs called the subtask algorithm. Chapter 5 details the ProofWidgets system, which is used to produce the user interface of HumanProof. Chapter 6 provides the design and results of a user study that I conducted on mathematicians to determine whether HumanProof really does provide understandable proofs. Finally, Chapter 7 wraps things up with some reflection on my progress and a look ahead to future work.

There are also four appendices:

1.5. Previously published work and collaboration

The work in Chapter 3 is my own, although the box calculus presented is inspired through many sessions of discussion with W.T. Gowers and the design of Gowers' previous collaboration with Ganesalingam [GG17[GG17]Ganesalingam, Mohan; Gowers, W. T.A fully automatic theorem prover with human-style output (2017)Journal of Automated Reasoning(link)]. More on this will be given when it is surveyed in Section 2.6 and Section 3.3.5.

The work in Chapter 4 is previously published at KI 2019 [AGJ19[AGJ19]Ayers, E. W.; Gowers, W. T.; Jamnik, MatejaA human-oriented term rewriting system (2019)KI 2019: Advances in Artificial Intelligence - 42nd German Conference on AI(link)].

The work presented in Chapter 5 is pending publication in ITP 2021 [AJG21[AJG21]Ayers, E. W.; Jamnik, Mateja; Gowers, W. T.A graphical user interface framework for formal verification (2021)Interactive Theorem Proving(link)] and is also merged in to the Lean 3 community repository. The design is strongly influenced by Elm and React; however, there are a number of novel architectural contributions necessitated by the unique challenges of implementing a portable framework within a proof assistant.

The user study presented in Chapter 6 is all my own work with a lot of advisory help from Mateja Jamnik, Gem Stapleton and Aaron Stockdill on designing the study.

1.6. Acknowledgements

I thank my supervisors W. T. Gowers and Mateja Jamnik for their ideas, encouragement and support and for letting me work on such a wacky topic. I thank Gabriel Ebner and Brian Gin-ge Chen for reading through my ProofWidgets PRs. I thank Patrick Massot, Kevin Buzzard and the rest of the Lean Prover community for complaining about my PRs after the fact. I thank Jeremy Avigad for taking the time to introduce me to Lean at the Big Proof conference back in 2017. I thank Bohua Zhan, Chris Sangwin, and Makarius Wenzel and many more for the enlightening conversations on automation for mathematicians at Big Proof and beyond. I thank Peter Koepke for being so generous in inviting me to Bonn to investigate Naproche/SAD with Steffan Frerix and Andrei Paskevich. I thank Larry Paulson and the ALEXANDRIA team for letting me crash their weekly meetings. I thank my parents for letting me write up in the house during lockdown.

I thank my friends and colleagues in the CMS. Andrew, Eric, Sammy P, Sven, Ferdia, Mithuna, Kasia, Sam O-T, Bhavik, Wojciech, and many more. In parallel, the Computer Laboratory: Chaitanya, Botty, Duo, Daniel, Aaron, Angeliki, Yiannos, Wenda, Zoreh.

This research was supported by EPSRC and the Cantab Capital Institute for the Mathematics of Information.

1.6.1. Typesetting acknowledgements

I decided to typeset this thesis as HTML-first, print second. The digital copy may be found at https://edayers.com/thesis. The printed version of this thesis was generated by printing out the website version and concatenating.

I was able to create the thesis in this way thanks to many open-source projects. I will acknowledge the main ones here. React, Gatsby, Tachyons, PrismJS. Thanks to Titus Woormer for remarkJS and also adding my feature request in less than 24 hours! The code font is PragmataPro created by Fabrizio Schiavi. The style of the site is a modified version of the Edward Tufte Handout style. The syntax colouring style is based on the VS theme by Andrew Lock. I also use some of the vscode-icons icons.

Chapter 2

In this chapter I will provide a variety of background material that will be used in later chapters. Later chapters will include links to the relevant sections of this chapter. I cover a variety of topics:

2.1. The architecture of proof assistants

In this section I am going to provide an overview of the designs of proof assistants for non-specialist. The viewpoint I present here is largely influenced by the viewpoint that Andrej Bauer expresses in a MathOverflow answer [Bau20[Bau20]Bauer, AndrejWhat makes dependent type theory more suitable than set theory for proof assistants? (2020)https://mathoverflow.net/q/376973].

The essential purpose of a proof assistant is to represent mathematical theorems, definitions and proofs in a language that can be robustly checked by a computer. This language is called the foundation language equipped with a set of derivation rules. The language defines the set of objects that formally represent mathematical statements and proofs, and the inference rules and axioms provide the valid ways in which these objects can be manipulatedAt this point, we may raise a number of philosophical objections such as whether the structures and derivations 'really' represent mathematical reasoning. The reader may enjoy the account given in the first chapter of Logic for Mathematicians by J. Barkley Rosser [Ros53].. Some examples of foundations are first-order logic (FOL), higher-order logic (HOL), and various forms of dependent type theory (DTT) [Mar84, CH88, PP89, Pro13].

A component of the software called the kernel checks proofs in the foundation. There are numerous foundations and kernel designs. Finding new foundations for mathematics is an open research area but FOL, HOL and DTT mentioned above are the most well-established for performing mathematics. I will categorise kernels as being either 'checkers' or 'builders'.

A 'checker' kernel takes as input a proof expression and outputs a yes/no answer to whether the term is a valid proof. An example of this is the Lean 3 kernel [MKA+15[MKA+15]de Moura, Leonardo; Kong, Soonho; Avigad, Jeremy; et al.The Lean theorem prover (system description) (2015)International Conference on Automated Deduction(link)].

A 'builder' kernel provides a fixed set of partial functions that can be used to build proofs. Anything that this set of functions accepts is considered as valid. This is called an LCF architecture, originated by Milner [Mil72[Mil72]Milner, RobinLogic for computable functions description of a machine implementation (1972)Technical Report(link), Gor00[Gor00]Gordon, MikeFrom LCF to HOL: a short history (2000)Proof, language, and interaction(link)]. The most widely used 'builder' is the Isabelle kernel by Paulson [Pau89[Pau89]Paulson, Lawrence CThe foundation of a generic theorem prover (1989)Journal of Automated Reasoning(link)].

Most kernels stick to a single foundation or family of foundations. The exception is Isabelle, which instead provides a 'meta-foundation' for defining foundations, however the majority of work in Isabelle uses the HOL foundation.

2.1.1. The need for a vernacular

One typically wants the kernel to be as simple as possible, because any bugs in the kernel may result in 'proving' a false statement An alternative approach is to 'bootstrap' increasingly complex kernels from simpler ones. An example of this is the Milawa theorem prover for ACL2 [Dav09].. For the same reason, the foundation language should also be as simple as possible. However, there is a trade-off between kernel simplicity and the usability and readability of the foundation language; a simplified foundation language will lack many convenient language features such as implicit arguments and pattern matching, and as a result will be more verbose. If the machine-verified definitions and lemmas are tedious to read and write, then the prover will not be adopted by users.

Proof assistant designers need to bridge this gap between a human-readable, human-understandable proof and a machine-readable, machine-checkable proof. A common approach is to use a second language called the vernacular (shown on Figure 2.5). The vernacular is designed as a human-and-machine-readable compromise that is converted to the foundation language through a process called elaboration (e.g., [MAKR15[MAKR15]de Moura, Leonardo; Avigad, Jeremy; Kong, Soonho; et al.Elaboration in Dependent Type Theory (2015)CoRR(link)]). The vernacular typically includes a variety of essential features such as implicit arguments and some form of type inference, as well as high-level programming features such as pattern matching. Optionally, there may be a compiler (see Figure 2.5) for the vernacular to also produce runnable code, for example Lean 3 can compile vernacular to bytecode [EUR+17[EUR+17]Ebner, Gabriel; Ullrich, Sebastian; Roesch, Jared; et al.A metaprogramming framework for formal verification (2017)Proceedings of the ACM on Programming Languages(link)].

I discuss some work on provers with the vernacular being a restricted form of natural language as one might find in a textbook in Section 2.7.2.

2.1.2. Programs for proving

Using a kernel for checking proofs and a vernacular structure for expressing theorems, we now need to be able to construct proofs of these theorems.

An Automated Theorem Prover (ATP) is a piece of software that produces proofs for a formal theorem statement automatically with a minimal amount of user input as to how to solve the proof, examples include Z3, E and Vampire.

Interactive Theorem Proving (ITP) is the process of creating proofs incrementally through user interaction with a prover. I will provide a review of user interfaces for ITP in Section 5.1. Most proof assistants incorporate various automated and interactive theorem proving components. Examples of ITPs include Isabelle [Pau89], Coq [Coq], Lean [MKA+15], HOL Light [Har09], Agda [Nor08], Mizar [GKN15], PVS [SORS01].

Figure 2.1

An example proof script from the Lean 3 theorem prover. The script proper are the lines between the begin and end keywords. Each line in the proof script corresponds to a tactic.

A common modality for allowing the user to interactively construct proofs is with the proof script (Figure 2.1), this is a sequence of textual commands, written by the user to invoke certain proving programs called tactics that manipulate a state representing a partially constructed proof. An example of a tactic is the assume tactic in Figure 2.1, which converts a goal-state of the form XY to XY. Some of these tactics my invoke various ATPs to assist in constructing proofs. Proof scripts may be purely linear as in Figure 2.1 or have a hierarchical structure such as in Isar [Wen99[Wen99]Wenzel, MarkusIsar - A Generic Interpretative Approach to Readable Formal Proof Documents (1999)Theorem Proving in Higher Order Logics(link)] or HiProof [ADL10[ADL10]Aspinall, David; Denney, Ewen; Lüth, ChristophTactics for hierarchical proof (2010)Mathematics in Computer Science(link)].

An alternative to a proof script is for the prover to generate an auxiliary proof object file that holds a representation of the proof that is not intended to be human readable. This is the approach taken by PVS [SORS01[SORS01]Shankar, Natarajan; Owre, Sam; Rushby, John M; et al.PVS prover guide (2001)Computer Science Laboratory, SRI International, Menlo Park, CA(link)] although I will not investigate this approach further in this thesis because most of the ITP systems use the proof-script approach.

In the process of proving a statement, a prover must keep track of partially built proofs. I will refer to these representations of partially built proofs as development calculi. I will return to development calculi in Section 2.4.

2.1.3. Foundation

A foundation for a prover is built from the following pieces:

  1. A language: defining inductive trees of data that we wish to talk about and also syntax for these trees.

  2. The judgements: meta-level predicates over the above trees.

  3. The inference rules: a generative set of rules for deriving judgements from other judgements.

To illustrate, the language of simply typed lambda calculus would be expressed as in (2.2).


Example of a BNF grammar specification. A and X are some sets of variables (usually strings of unicode letters).

𝑥, 𝑦, 𝑧 ::= X -- variable
α, β ::= A | αβ -- type
𝑠, 𝑡 ::= 𝑠 𝑡 | λ (𝑥 : α), 𝑠 | X -- term
Γ ::=| Γ, (𝑥 : α) -- context

In (2.2), the purple greek and italicised letters (𝑥, 𝑦, α, ...) are called nonterminals. They say: "You can replace me with any of the |-separated items on the right-hand-side of my ::=". So, for example, "α" can be replaced with either a member of A or "αβ". The green words in the final column give a natural language noun to describe the 'kind' of the syntax.

In general terms, contexts Γ perform the role of tracking which variables are currently in scope. To see why contexts are needed, consider the expression 𝑥 + 𝑦; its resulting type depends on the types of the variables 𝑥 and 𝑦. If 𝑥 and 𝑦 are both natural numbers, 𝑥 + 𝑦 will be a natural number, but if 𝑥 and 𝑦 have different types (e.g, vectors, strings, complex numbers) then 𝑥 + 𝑦 will have a different type too. The correct interpretation of 𝑥 + 𝑦 depends on the context of the expression.

Next, define the judgements for our system in (2.3). Judgements are statements about the language.


Judgements for an example lambda calculus foundation. Γ, 𝑡 and α may be replaced with expressions drawn from the grammar in (2.2)

Γ𝑡 : α

Then define the natural deduction rules (2.4) for inductively deriving these judgements.


Judgement derivation rules for the example lambda calculus (2.2). Each rule gives a recipe for creating new judgements: given the judgements above the horizontal line, we can derive the judgement below the line (substituting the non-terminals for the appropriate ground terms). In this way one can inductively produce judgements.

Γ ok
(𝑥 : α)Γ

[..Γ, (𝑥 : α)] ok
(𝑥 : α)Γ

Γ𝑥 : α
Γ𝑠 : αβ
Γ𝑡 : α

Γ𝑠 𝑡 : β
Γ, (𝑥 : α)𝑡 : β

Γ(λ (𝑥 : α), 𝑡) : αβ

And from this point, it is possible to start exploring the theoretical properties of the system. For example: is Γ𝑠 : α decidable?

Foundations such as the example above are usually written down in papers as a BNF grammar and a spread of gammas, turnstiles and lines as illustrated in (2.2), (2.3) and (2.4). LISP pioneer Steele calls it Computer Science Metanotation [Ste17[Ste17]Steele Jr., Guy L.It's Time for a New Old Language (2017)http://2017.clojure-conj.org/guy-steele/].

In implementations of proof assistants, the foundation typically doesn't separate quite as cleanly in to the above pieces. The language is implemented with a number of optimisations such as de Bruijn indexing [deB72[deB72]de Bruijn, Nicolaas GovertLambda calculus notation with nameless dummies, a tool for automatic formula manipulation, with application to the Church-Rosser theorem (1972)Indagationes Mathematicae (Proceedings)(link)] for the sake of efficiency. Judgements and rules are implicitly encoded in algorithms such as type checking, or appear in forms different from that in the corresponding paper. This is primarily for efficiency and extensibility.

In this thesis the formalisation language that I focus on is the calculus of inductive constructions (CIC) Calculus of Inductive Constructions. Inductive datastructures (Section 2.2.3) for the Calculus of Constructions [CH88] were first introduced by Pfenning et al [PP89]. This is the the type theory used by Lean 3 as implemented by de Moura et al and formally documented by Carneiro [Car19[Car19]Carneiro, MarioLean's Type Theory (2019)Masters' thesis (Carnegie Mellon University)(link)]. A good introduction to mathematical formalisation with dependent type theory is the first chapter of the HoTT Book [Pro13[Pro13]The Univalent Foundations ProgramHomotopy Type Theory: Univalent Foundations of Mathematics (2013)publisher Institute for Advanced Study(link) ch. 1]. Other foundations are also available: Isabelle's foundation is two-tiered [Pau89[Pau89]Paulson, Lawrence CThe foundation of a generic theorem prover (1989)Journal of Automated Reasoning(link)]: there is a meta-level foundation upon which many foundations can be implemented. A lot of the work in this thesis is independent of foundation and so I will try to indicate how the contributions can be augmented to work in other foundations.

A typical architecture of a modern, full-fledged checker-style proof assistant is given in Figure 2.5.

Figure 2.5

Schematic overview of a typical modern kernel-based proof assistant.

prover architecture diagram

2.2. Preliminaries

This section contains a set of quick preliminary definitions for the concepts and notation that I will be using later. In this thesis I will be using a pseudo-language which should be familiar to functional programming enthusiasts. This pseudo-language is purely presentational and is used to represent algorithms and datastructures for working with theorem provers.

2.2.1. Some notation for talking about type theory and algorithms

The world is built of types and terms. New variables are introduced as "𝑥 : A"; 𝑥 is the variable and it has the type A. Lots of variables with the same type can be introduced as 𝑥 𝑦 𝑧 : A. Types A B C : Type start with an uppercase letter and are coloured turquoise. Type is a special 'type of types'. Meanwhile terms start with a lowercase letter and term variables are purple and italicised. AB is the function type. is right associative which means that 𝑓 : ABC should be read as 𝑓 : A(BC). This is called a curried function, we may consider A and B to be the input arguments of 𝑓 and C to be its return type. Given 𝑎 : A we may apply 𝑓 to 𝑎 by writing 𝑓 𝑎 : BC. Functions are introduced using maps-to notation (𝑎 : A)(𝑏 : B)𝑓 𝑎 𝑏. Write the identity function 𝑥𝑥 as 𝟙 : XX. Given 𝑓 : AB, 𝑔 : BC, write function composition as 𝑔𝑓 : AC. Function application is left associative, so 𝑓 𝑎 𝑏 should be read as (𝑓(𝑎))(𝑏). The input types of functions may optionally be given argument names, such as: (𝑎 : A)(𝑏 : B)C. We also allow 'dependent types' where the return value C is allowed to depend on these arguments: (𝑎 : A) → 𝒞 𝑎 where 𝒞 : AType is a type-valued function.

2.2.2. Functors and monads

I will assume that the readers are already familiar with the motivation behind functors and monads in category theory and as used in e.g. Haskell but I will summarise them here for completeness. I refer the unfamiliar reader to the Haskell Typeclassopediahttps://wiki.haskell.org/Typeclassopedia.

Definition 2.6 (functor): A functor is a type-valued function F : TypeType equipped with a function mapper F (𝑓 : AB) : F AF BHere, the word 'functor' is used to mean the special case of category-theoretical functors with the domain and codomain category being the category of Type.. I always assume that the functor is lawful, which here means it obeys the functor laws (2.7).


Laws for functors.

F (𝑓𝑔) = (F 𝑓)(F 𝑔)
F (𝑥𝑥) 𝑦 = 𝑦

Definition 2.8 (natural function): A natural function a : FG between functors F G : TypeType is a family of functions a[A] : F AG A indexed by A : Type such that a[B]F f = G fa[A] for all f : AB. Often the type argument to a will be suppressed. It is quick to verify that the functors and natural functors over them form a category.

Definition 2.9 (monad): A monadFor learning about programming with monads, see https://wiki.haskell.org/All_About_Monads M : TypeType is a functor equipped with two natural functions pure : 𝟙 ⇒ M and join : M MM obeying the monad laws (2.10). Write 𝑚 >>= 𝑓 := join (M 𝑓 𝑚) for 𝑚 : M A and 𝑓 : AM B. do notation is used in placeshttps://wiki.haskell.org/Keywords#do.


Laws for monads.

join[X](M join[X]) = join[X](join[M X])
join[X](M pure[X]) = pure X
join[X](pure[M X]) = pure X

Definition 2.11 (applicative): An applicative functor [MP08[MP08]McBride, Conor; Paterson, RossApplicative programming with effects (2008)J. Funct. Program.(link) §2] M : TypeType is equipped with pure : AM A and seq : M (AB)M AM B. Write 𝑓 <*> 𝑎 := seq 𝑓 𝑥<*> is left associative: 𝑢 <*> 𝑣 <*> 𝑤 = (𝑢 <*> 𝑣) <*> 𝑤. and 𝑎 *> 𝑏 := seq (_𝑎) 𝑏. Applicative functors obey the laws given in (2.12).


Laws for applicative functors. I use the same laws as presented by McBride [MP08] but other equivalent sets are available.

(pure 𝟙) <*> 𝑢 = 𝑢
(pure ()) <*> 𝑢 <*> 𝑣 <*> 𝑤 = 𝑢 <*> (𝑣 <*> 𝑤)
(pure 𝑓) <*> (pure 𝑥) = pure (𝑓 𝑥)
𝑢 <*> pure 𝑥 = pure (𝑓𝑓 𝑥) <*> 𝑢

2.2.3. Inductive datatypes

New inductive datatypes are defined with a GADT-like syntax (2.13).


Example inductive definition of List using a nil : List X and cons : XList XList X are the constructors.

List (X : Type) ::=
| nil
| cons (x : X) (l : List X)

In cases where it is obvious which constructor is being used, the tag names are suppressed. Function definitions with pattern matching use the syntax given in (2.14).


Example of the definition of a function f using pattern matching. The inl and inr constructors are suppressed in the pattern. Provocative spacing is used instead to suggest which case is being matched on.

f : Bool + (X × Y)
| true3
| false0
| (𝑥, 𝑦)2

One can express inductive datatypes D as fixpoints of functors D = Fix P where Fix P := P (Fix P). Depending on the underlying category, Fix P may not exist for all PSmyth and Plotkin are the first to place some conditions on when the fixpoint exists [SP82], see Adámek et al for a survey [AMM18].[SP82]Smyth, Michael B; Plotkin, Gordon DThe category-theoretic solution of recursive domain equations (1982)SIAM Journal on Computing(link)[AMM18]Adámek, Jiří; Milius, Stefan; Moss, Lawrence SFixed points of functors (2018)Journal of Logical and Algebraic Methods in Programming(link).

Definition 2.15 (base functor): When a D : Type is written as Fix P for some P (and there is no Q such that P = QQ...Q), P is called the base functor for D. This conceptualisation is useful because we can use the base functor to make related types without needing to explicitly write down the constructors for the modified versions. For example we can make the list lazy with Lazy P X := Fix ((XUnitX)P).

2.3. Inductive gadgets

For the rest of this thesis, I will make use of a few motifs for discussing inductive datastructures, particularly in Section 2.4, Chapter 3, Appendix A and Appendix C. In this section I will lay some background material for working with inductive datatypes.

2.3.1. Traversable functors

Given a monad M, a common task is performing a monad-map with f : AM B over a list of objects l : List X. This is done with the help of a function called mmap (2.16).


Definition of a 'monad map' for over lists for an applicative functor M : TypeType and A B : Type.

mmap (𝑓 : AM B)
: List AM (List B)
| []pure []
| (::𝑙)pure cons <*> 𝑓 <*> mmap 𝑓 𝑙

But we can generalise List to some functor T : TypeType; when can we equip an analogous mmap to T? For example, in the case of binary trees (2.17).


Inductive definition of binary trees and a definition of mmap to compare with (2.16).

Tree A ::=
| leaf : Tree A
| branch : Tree AATree ATree A
mmap (𝑓 : AM B)
: Tree AM (Tree B)
| leafpure leaf
| (branch 𝑙 𝑎 𝑟)
pure branch <*> mmap 𝑓 𝑙 <*> 𝑓 𝑎 <*> mmap 𝑓 𝑟

Definition 2.18 (traversable): A functor T : TypeType is traversable when for all applicative functors (Definition 2.11) M : TypeType, there is a natural function d[M] : (TM)(MT). That is, for each X : Type we have d[M][X] : T (M X)M (T X). In addition to being natural, d must obey the traversal laws given in (2.19) [JR12[JR12]Jaskelioff, Mauro; Rypacek, OndrejAn Investigation of the Laws of Traversals (2012)Proceedings Fourth Workshop on Mathematically Structured Functional Programming, MSFP@ETAPS 2012, Tallinn, Estonia(link) Definition 3.3].


Commutative diagrams for the traversal laws. The leftmost diagram must hold for any natural function a : FG.

Given a traversable functor T and a monad M, we can recover mmap : (AM B)T AM (T B) as mmap 𝑓 𝑡 := d[M][B] (T 𝑓 𝑡).

2.3.2. Functors with coordinates

Bird et al [BGM+13[BGM+13]Bird, Richard; Gibbons, Jeremy; Mehner, Stefan; et al.Understanding idiomatic traversals backwards and forwards (2013)Proceedings of the 2013 ACM SIGPLAN symposium on Haskell(link)] prove that (in the category of sets) the traversable functors are equivalent to a class of functors called finitary containers. Their theorem states that there is a type Shape T 𝑛 : TypeAn explicit definition of Shape T 𝑛 is the pullback of children[1] : T UnitList Unit and !𝑛 : UnitList Unit, the list with 𝑛 elements. for each traversable T and 𝑛 : such that that each 𝑡 : T X is isomorphic to an object called a finitary container on Shape T shown in (2.20).


A finitary container is a count 𝑛, a shape 𝑠 : Shape T length and a vector children. Vec length X is the type of lists in X with length length.

(length : )
× (shape : Shape T length)
× (children : Vec length X)

map and traverse may be defined for the finitary container as map and traverse over the children vector. Since 𝑡 : T X has 𝑡.length child elements, the children of 𝑡 can be indexed by the numbers {𝑘 : | 𝑘 < length}. We can then define operations to get and set individual elements according to this index 𝑘.

Usually, however, this numerical indexing of the children of 𝑡 : T X loses the semantics of the datatype. As an example; consider the case of a binary tree Tree in (2.21). A tree 𝑡 : Tree X with 𝑛 branch components will have length 𝑛 and a corresponding children : Vec 𝑛 X, but indexing via numerical indices {𝑘 | 𝑘 < 𝑛} loses information about where the particular child 𝑥 : X can be found in the tree.


Definition of binary trees using a base functor. Compare with the definition (2.17).

TreeBase A X ::=
| leaf : TreeBase X
| branch : TreeBase XATreeBase XTreeBase X
Tree A := Fix (TreeBase A)

Now I will introduce a new way of indexing the members of children for the purpose of reasoning about inductive datatypes. This idea has been used and noted before many times, the main one being paths in universal algebra [BN98[BN98]Baader, Franz; Nipkow, TobiasTerm rewriting and all that (1998)publisher Cambridge University Press(link) Dfn. 3.1.3]. However, I have not seen an explicit account of this idea in the general setting of traversable functors and later to general inductive datatypes (Section 2.3.3).

Definition 2.22 (coordinates): A traversable functor T has coordinates when equipped with a type C : Type and a function coords[𝑛] : Shape T 𝑛Vec 𝑛 C. The coords function amounts to a labelling of the 𝑛 children of a particular shape with members of C.

Often when using traversals, working with the children list Vec (length 𝑡) X for each shape of T can become unwieldy, so it is convenient to instead explicitly provide a pair of functions get and set (2.23) for manipulating particular children of a given 𝑡 : T X.


Getter and setter signatures and equations. Here 𝑙[𝑖] is the 𝑖th member of 𝑙 : List X and Vec.set 𝑖 𝑣 𝑥 replaces the 𝑖th member of the vector 𝑣 : Vec 𝑛 X with 𝑥 : X.

get : CT XOption X
set : CT XXT X
get 𝑐 𝑡 = if𝑖, (coords 𝑡)[𝑖] = 𝑐
then some 𝑡.children[𝑖]
else none
set 𝑐 𝑡 𝑥 = if𝑖, (coords 𝑡)[𝑖] = 𝑐
then Vec.set 𝑖 𝑡.children 𝑥
else 𝑡

C is not unique, and in general should be chosen to have some semantic value for thinking about the structure of T. Here are some examples of functors with coordinates:


Defining the List Bool coordinates for binary trees. Here the left/right items in the C = List D can be interpreted as a sequence of "take the left/right branch" instructions. set is omitted for brevity but follows a similar patter to get.

D ::= | left | right
: Tree XList (List D)
| leaf[]
| branch 𝑙 𝑥 𝑟
[ ..[[left, ..𝑐] for 𝑐 in coords 𝑙]
, []
, ..[[right , ..𝑐] for 𝑐 in coords 𝑟]
get : List (List Bool)Tree XOption X
| _leafnone
| []branch 𝑙 𝑥 𝑟some 𝑥
| [left, ..𝑐]branch 𝑙 𝑥 𝑟get 𝑐 𝑙
| [right , ..𝑐]branch 𝑙 𝑥 𝑟get 𝑐 𝑟

2.3.3. Coordinates on initial algebras of traversable functors

Given a functor F with coordinates C, we can induce coordinates on the free monad Free F : TypeType of F. The free monad is defined concretely in (2.25).


Definition of a free monad Free F X and join for a functor F : TypeType and X : Type.

Free F X ::=
| pure : XFree F X
| make : F(Free F X)Free F X
join : (Free F (Free F X))Free F X
| pure 𝑥pure 𝑥
| (make 𝑓)make (F join 𝑓)

We can write Free F X as the fixpoint of AX + F AAs mentioned in Section 2.2.3, these fixpoints may not exist. However for the purposes of this thesis the Fs of interest are always polynomial functors.. Free F has coordinates List C with methods defined in (2.26).


Definitions of the coordinate methods for Free F given F has coordinates C. Compare with the concrete binary tree definitions (2.24).

coords : Free F XList (List C)
| pure 𝑥[]
| make 𝑓
[ [𝑐, ..𝑎]
for 𝑎 in coords (get 𝑐 𝑓)
for 𝑐 in coords 𝑓]
get : List CFree F XOption X
| []pure 𝑥some 𝑥
| [𝑐, ..𝑎]make 𝑓(get 𝑐 𝑓) >>= get 𝑎
| __none
set : List CFree F XXFree F X
| []pure _𝑥pure 𝑥
| [𝑐, ..𝑎]make 𝑓𝑥(set 𝑐 𝑓)
| __none

In a similar manner, List C can be used to reference particular subtrees of an inductive datatype D which is the fixpoint of a traversable functor D = F D. Let F have coordinates C. D here is not a functor, but we can similarly define coords : DList (List C), get : List COption D and set : List CDDD.

The advantage of using coordinates over some other system such as optics [FGM+07[FGM+07]Foster, J Nathan; Greenwald, Michael B; Moore, Jonathan T; et al.Combinators for bidirectional tree transformations: A linguistic approach to the view-update problem (2007)ACM Transactions on Programming Languages and Systems (TOPLAS)(link)] or other apparati for working with datatypes [LP03[LP03]Lämmel, Ralf; Peyton Jones, SimonScrap Your Boilerplate (2003)Programming Languages and Systems, First Asian Symposium, APLAS 2003, Beijing, China, November 27-29, 2003, Proceedings(link)] is that they are much simpler to reason about. A coordinate is just an address of a particular subtree. Another advantage is that the choice of C can convey some semantics on what the coordinate is referencing (for example, C = left | right in (2.24)), which can be lost in other ways of manipulating datastructures.

2.4. Metavariables

Now with a way of talking about logical foundations, we can resume from Section 2.1.2 and consider the problem of how to represent partially constructed terms and proofs given a foundation. This is the purpose of a development calculus: to take some logical system and produce some new system such that one can incrementally build terms and proofs in a way that provides feedback at intermediate points and ensures that various judgements hold for these intermediate terms. In Chapter 3, I will create a new development calculus for building human-like proofs, and in Appendix A this system will be connected to Lean. First we look at how Lean's current development calculus behaves. Since I will be using Lean 3 in this thesis and performing various operations over its expressions, I will follow the same general setup as is used in Lean 3. The design presented here was first developed by Spiwack [Spi11[Spi11]Spiwack, ArnaudVerified computing in homological algebra, a journey exploring the power and limits of dependent type theory (2011)PhD thesis (INRIA)(link)] first released in Coq 8.5. It was built to allow for a type-safe treatment of creating tactics with metavariables in a dependently-typed foundation.

2.4.1. Expressions and types

In this section I will introduce the expression foundation language that will be used for the remainder of the thesis. The system presented here is typical of expression structures found in DTT-based provers such as Lean 3 and Coq. I will not go into detail on induction schema and other advanced features because the work in this thesis is independent of them.

Definition 2.27 (expression): A Lean expression is a recursive datastructure Expr defined in (2.28).


Definition of a base functor for pure DTT expressions as used by Lean.

ExprBase X ::=
| lambda : BinderXExprBase X -- function abstraction
| pi : BinderXExprBase X -- dependent function type
| var : NameExprBase X -- variables
| const : NameExprBase X -- constants
| app : XXExprBase X -- function application
| sort : LevelExprBase X -- type universe
Binder := (name : Name) × (type : Expr)
Context := List Binder
Expr := Fix ExprBase

In (2.28), Level can be thought of as expressions over some signature that evaluate to natural numbers. They are used to stratify Lean's types so that one can avoid Girard's paradox [Hur95[Hur95]Hurkens, Antonius J. C.A simplification of Girard's paradox (1995)International Conference on Typed Lambda Calculi and Applications(link)]. Name is a type of easily distinguishable identifiers; in the case of Lean Names are lists of strings or numbers. I sugar lambda 𝑥 α 𝑏 as λ (𝑥α), 𝑏, pi 𝑥 α 𝑏 as Π (𝑥α), 𝑏, app 𝑓 𝑎 as 𝑓 𝑎 and omit var and const when it is clear what the appropriate constructor is.

Using ExprBase, define pure expressions Expr := Fix ExprBase as in Section 2.2.3. Note that it is important to distinguish between the meta-level type system introduced in Section 2.2 and the object-level type system where the 'types' are merely instances of ExprThis distinction can always be deduced from syntax, but to give a subtle indication of this distinction, object-level type assignment statements such as (𝑥α) are annotated with a slightly smaller variant of the colon as opposed to : which is used for meta-level statements.. That is, 𝑡 : Expr is a meta-level statement indicating that 𝑡 is an expression, but 𝑡α is an object-level judgement about expressions stating that 𝑡 has the type α, where α : Expr and αsort.

Definition 2.29 (variable binding): Variables may be bound by λ and Π expressions. For example, in λ (𝑥α), 𝑡, we say that the expression binds 𝑥 in 𝑡. If 𝑡 contains variables that are not bound, these are called free variables. Now, given a partial map σ : NameExpr and a term 𝑡 : Expr, we define a substitution subst σ 𝑡 : Expr as in (2.30). This will be written as σ 𝑡 for brevity.


Definition of substitution on an expression. Here, ExprBase (subst σ) 𝑒 is mapping each child expression of 𝑒 with subst σ; see Section 2.2.3.

subst σ : ExprExpr
| var 𝑥if 𝑥dom σ then σ 𝑥 else 𝑥
| 𝑒ExprBase (subst σ) 𝑒

I will denote substitutions as a list of NameExpr pairs. For example, 𝑥𝑡, 𝑦𝑠 where 𝑥 𝑦 : Name are the variables which will be substituted for terms 𝑡 𝑠 : Expr respectively.

Substitution can easily lead to badly-formed expressions if there are variable naming clashes. I need only note here that we can always perform a renaming of variables in a given expression to avoid clashes upon substitution. These clashes are usually avoided within prover implementations with the help of de-Bruijn indexing [deB72].

2.4.2. Assignable datatypes

Given an expression structure Expr and 𝑡 : Expr, we can define a traversal over all of the immediate subexpressions of 𝑡.


Illustrative code for mapping the immediate subexpressions of an expression using child_traverse.

child_traverse (M : Monad) (𝑓 : ContextExprM Expr)
: ContextExprM Expr
| Γ(Expr.var 𝑛)(Expr.var 𝑛)
| Γ(Expr.app 𝑙 𝑟)
pure (Expr.app) <*> 𝑓 Γ 𝑙 <*> 𝑓 Γ 𝑟
| Γ(Expr.lambda 𝑛 α 𝑏)
pure (Expr.lambda 𝑛) <*> 𝑓 Γ α <*> 𝑓 [..Γ, (𝑛:α)] 𝑏

The function child_traverse defined in (2.31) is different from a normal traversal of a datatructure because the mapping function 𝑓 is also passed a context Γ indicating the current variable context of the subexpression. Thus when exploring a λ-binder, 𝑓 can take into account the modified context. This means that we can define context-aware expression manipulating tools such as counting the number of free variables in an expression (fv in (2.32)).


Some example implementations of expression manipulating tools with the child_traverse construct. The monad structure on Set is pure := 𝑥{𝑥} and join (𝑠 : Set Set X) :=𝑠 and map 𝑓 𝑠 := 𝑓[𝑠]. fv stands for 'free variables'.

instantiate : NameExprContextExprExpr
| 𝑥𝑟Γ(Expr.var 𝑛)if (𝑥 = 𝑛) then 𝑟 else Expr.var 𝑛
| 𝑥𝑟Γ𝑡child_traverse 𝟙 (instantiate 𝑥 𝑟) Γ 𝑡
fv : ContextExprSet Name
| Γ(Expr.var 𝑛)if 𝑛Γ thenelse {𝑛}
| Γ𝑡child_traverse Set (fv) Γ 𝑡

The idea here is to generalise child_traverse to include any datatype that may involve expressions. Frequently when building systems for proving, one has to make custom datastructures. For example, one might wish to create a 'rewrite-rule' structure (2.33) for modelling equational reasoning (as will be done in Chapter 4).


Simple RewriteRule representation defined as a pair of Exprs, representing lhs = rhs. This is to illustrate the concept of assignable datatypes.

RewriteRule := (lhs : Expr) × (rhs : Expr)

Definition 2.34 (telescope): Another example might be a telescope of binders Δ : List Binder a list of binders is defined as a telescope in Γ : Context when each successive binder is defined in the context of the binders before it. That is, [] is a telescope and [(𝑥α), ..Δ] is a telescope in Γ if Δ is a telescope in [..Γ, (𝑥α)] and Γ𝑥α.

But now if we want to perform a variable instantiation or count the number of free variables present in 𝑟 : RewriteRule, we have to write custom definitions to do this. The usual traversal functions from Section 2.3.1 are not adequate for telescopes, because we may need to take into account a binder structure. Traversing a telescope as a simple list of names and expressions will produce the wrong output for fv, because some of the variables are bound by previous binders in the context.

Definition 2.35 (assignable): To avoid having to write all of this boilerplate, let's make a typeclass assignable (2.36) on datatypes that we need to manipulate the expressions in. The expr_traverse method in (2.36) traverses over the child expressions of a datatype (e.g., the lhs and rhs of a RewriteRule or the type expressions in a telescope). expr_traverse also includes a Context object to enable traversal of child expressions which may be in a different context to the parent datatype.


Say that a type X is assignable by equipping X with the given expr_traverse operation. Implementations of expr_traverse for RewriteRule (2.33) and telescopes are given as examples.

class assignable (X : Type) :=
(expr_traverse :
(M : Monad)
(ContextExprM Expr)
ContextXM X
expr_traverse M 𝑓
: ContextRewriteRuleRewriteRule
| Γ(𝑙, 𝑟)do
𝑙' ← 𝑓 Γ 𝑙;
𝑟' ← 𝑓 Γ 𝑟;
pure𝑙, 𝑟
expr_traverse M 𝑓
: ContextTelescopeTelescope
| Γ[]pure []
| Γ[(𝑥α), ..Δ]do
α' ← 𝑓 Γ α;
Δ' ← expr_traverse M 𝑓 [..Γ, (𝑥α)] Δ;
pure [(𝑥α'), ..Δ']

Now, provided expr_traverse is defined for X: fv, instantiate and other expression-manipulating operations such as those in (2.32) can be modified to use expr_traverse instead of child_traverse. This assignable regime becomes useful when using de-Bruijn indices to represent bound variables [deB72] because the length of Γ can be used to determine the binder depth of the current expression. Examples of implementations of assignable and expression-manipulating operations that can make use of assignable can be found in my Lean implementation of this concepthttps://github.com/leanprover-community/mathlib/pull/5719.

2.4.3. Lean's development calculus

In the Lean source code, there are constructors for Expr other than those in (2.30). Some are for convenience or efficiency reasons (such as Lean 3 macros), but others are part of the Lean development calculus. The main development calculus construction is mvar or a metavariable, sometimes also called a existential variable or schematic variable. An mvar ?m acts as a 'hole' for an expression to be placed in later. There is no kernel machinery to guarantee that an expression containing a metavariable is correct; instead, they are used for the process of building expressions.

As an example, suppose that we needed to prove PQ for some propositions P QProp. The metavariable-based approach to proving this would be to declare a new metavariable ?𝑡PQ. Then, a prover constructs a proof term for PQ in two steps; declare two new metavariables ?𝑡₁P and ?𝑡₂Q; and then assign ?𝑡 with the expression and.make ?𝑡₁ ?𝑡₂ where and.makePQPQ is the constructor for . After this, ?𝑡₁ and ?𝑡₂ themselves are assigned with pP and qQ. In this way, the proof term can be built up slowly as ?𝑡and.make ?𝑡₁ ?𝑡₂and.make p ?𝑡₂and.make p q. This process is more convenient for building modular programs that construct proofs than requiring that a pure proof term be made all in one go because a partially constructed proof is represented as a proof term where certain subexpressions are metavariables.

Lean comes with a development calculus that uses metavariables. This section can be viewed as a more detailed version of the account originally given by de Moura et al [MAKR15[MAKR15]de Moura, Leonardo; Avigad, Jeremy; Kong, Soonho; et al.Elaboration in Dependent Type Theory (2015)CoRR(link) §3.2] with the additional details sourced from inspecting the Lean source code. Lean's metavariable management system makes use of a stateful global 'metavariable context' with carefully formed rules governing valid assignments of metavariables. While all automated provers make use of some form of metavariables, this specific approach to managing them for use with tactics was first introduced in Spiwack's thesis [Spi11], where the tactic monad for Coq was augmented with a stateful global metavariable context.

The implementation of Lean allows another Expr constructor for metavariables:


Redefining Expr with metavariables using the base functor given in (2.28).

Expr ::=
| ExprBase Expr
| ?Name

Metavariables are 'expression holes' and are denoted as ?𝑥 where 𝑥 : Name. They are placeholders into which we promise to substitute a valid pure expression later. Similarly to fv(𝑡) being the free variables in 𝑡 : Expr, we can define mv(𝑡) to be the set of metavariables present in 𝑡. However, we still need to be able to typecheck and reduce expressions involving metavariables and so we need to have some additional structure on the context.

The idea is that in addition to a local context Γ, expressions are inspected and created within the scope of a second context called the metavariable context 𝑀 : MvarContext. The metavariable context is a dictionary MvarContext := NameMvarDecl where each metavariable declaration 𝑑 : MvarDecl has the following information:

The metavariable context can be used to typecheck an expression containing metavariables by assigning each occurrence ?𝑥 with the type given by the corresponding declaration 𝑀[𝑥].type in 𝑀. The assignment field of MvarDecl is used to perform instantiation. We can interpret 𝑀 as a substitution.

As mentioned in Section 2.1.2, the purpose of the development calculus is to represent a partially constructed proof or term. The kernel does not need to check expressions in the development calculus (which here means expressions containing metavariables), so there is no need to ensure that an expression using metavariables is sound in the sense that declaring and assigning metavariables will be compatible with some set of inference rules such as those given in (2.4). However, in Appendix A.1, I will provide some inference rules for typing expressions containing metavariables to assist in showing that the system introduced in Chapter 3 is compatible with Lean.

2.4.4. Tactics

A partially constructed proof or term in Lean is represented as a TacticState object. For our purposes, this can be considered as holding the following data:

TacticState :=
(result : Expr)
× (mctx : MvarContext)
× (goals : List Expr)
Tactic (A : Type) := TacticStateOption (TacticState × A)

The result field is the final expression that will be returned when the tactic completes. goals is a list of metavariables that are used to denote what the tactic state is currently 'focussing on'. Both goals and result are in the context of mctx.

Tactics may perform actions such as modifying the goals or performing assignments of metavariables. In this way, a user may interactively build a proof object by issuing a stream of tactics.

2.5. Understandability and confidence

This section is a short survey of literature on what it means for a mathematical proof to be understandable. This is used in Chapter 6 to evaluate my software and to motivate the design of the software in Chapter 3 and Chapter 4.

2.5.1. Understandability of mathematics in a broader context

What does it mean for a proof to be understandable? An early answer to this question comes from the 19th century philosopher Spinoza. Spinoza [Spi87[Spi87]Spinoza, BenedictThe chief works of Benedict de Spinoza (1887)publisher Chiswick Press(link)] supposes 'four levels' of a student's understanding of a given mathematical principle or rule, which are:

  1. mechanical: The student has learnt a recipe to solve the problem, but no more than that.

  2. inductive: The student has verified the correctness of the rule in a few concrete cases.

  3. rational: The student comprehends a proof of the rule and so can see why it is true generally.

  4. intuitive: The student is so familiar and immersed in the rule that they cannot comprehend it not being true.

For the purposes of this thesis I will restrict my attention to type 3 understanding. That is, how the student digests a proof of a general result. If the student is at level 4, and treats the result like a fish treats water, then there seems to be little an ITP system can offer other than perhaps forcing any surprising counterexamples to arise when the student attempts to formalise it.

Edwina Michener's Understanding Understanding Mathematics [Mic78[Mic78]Michener, Edwina RisslandUnderstanding understanding mathematics (1978)Cognitive science(link)] provides a wide ontology of methods for understanding mathematics. Michener (p. 373) proposes that "understanding is a complementary process to problem solving" and incorporates Spinoza's 4-level model. She also references Poincaré's thoughts on understanding [Poi14[Poi14]Poincaré, HenriScience and method (1914)publisher Amazon (out of copyright)(link) p. 118], from which I will take an extended quote from the original:

What is understanding? Has the word the same meaning for everybody? Does understanding the demonstration of a theorem consist in examining each of the syllogisms of which it is composed and being convinced that it is correct and conforms to the rules of the game? ...

Yes, for some it is; when they have arrived at the conviction, they will say, I understand. But not for the majority... They want to know not only whether the syllogisms are correct, but why there are linked together in one order rather than in another. As long as they appear to them engendered by caprice, and not by an intelligence constantly conscious of the end to be attained, they do not think they have understood.

In a similar spirit; de Millo, Lipton and Perlis [MUP79[MUP79]de Millo, Richard A; Upton, Richard J; Perlis, Alan JSocial processes and proofs of theorems and programs (1979)Communications of the ACM(link)] write referring directly to the nascent field of program verification (here referred to 'proofs of software')

Mathematical proofs increase our confidence in the truth of mathematical statements only after they have been subjected to the social mechanisms of the mathematical community. These same mechanisms doom the so-called proofs of software, the long formal verifications that correspond, not to the working mathematical proof, but to the imaginary logical structure that the mathematician conjures up to describe his feeling of belief. Verifications are not messages; a person who ran out into the hall to communicate his latest verification would rapidly find himself a social pariah. Verifications cannot really be read; a reader can flay himself through one of the shorter ones by dint of heroic effort, but that's not reading. Being unreadable and - literally - unspeakable, verifications cannot be internalized, transformed, generalized, used, connected to other disciplines, and eventually incorporated into a community consciousness. They cannot acquire credibility gradually, as a mathematical theorem does; one either believes them blindly, as a pure act of faith, or not at all.

Poincaré's concern is that a verified proof is not sufficient for understanding. De Millo et al question whether a verified proof is a proof at all! Even if a result has been technically proven, mathematicians care about the structure and ideas behind the proof itself. If this were not the case, then it would be difficult to explain why new proofs of known results are valued by mathematicians. I explore the question of what exactly they value in Chapter 6.

Many studies investigating mathematical understanding within an educational context exist, see the work of Sierpinska [Sie90[Sie90]Sierpinska, AnnaSome remarks on understanding in mathematics (1990)For the learning of mathematics(link), Sie94[Sie94]Sierpinska, AnnaUnderstanding in mathematics (1994)publisher Psychology Press(link)] for a summary. See also Pólya's manual on the same topic [Pól62[Pól62]Pólya, GeorgeMathematical Discovery (1962)publisher John Wiley & Sons(link)].

2.5.2. Confidence

Another line of inquiry suggested by Poincaré's quote is distinguishing confidence in a proof from a proof being understandable. By confidence in a proof, I do not mean confidence in the result being true, but instead confidence in the given script actually being a valid proof of the result.

Figure 2.39

A cartoon illustrating a component of the proof of the Jordan curve theorem for polygons as described by Hales [Hal07]. Call the edge of the purple polygon , then the claim that this cartoon illustrates is that given any disk in red and for any point not on , we can 'walk along a simple polygonal arc' (here in green) to the disk .

As an illustrative example, I will give my own impressions on some proofs of the Jordan curve theorem which states that any non-intersecting continuous loop in the 2D Euclidean plane has an interior region and an exterior region. Formal and informal proofs of this theorem are discussed by Hales [Hal07[Hal07]Hales, Thomas CThe Jordan curve theorem, formally and informally (2007)The American Mathematical Monthly(link)]. I am confident that the proof of the Jordan curve theorem formalised by Hales in the HOL Light proof assistant is correct although I can't claim to understand it in full. Contrast this with the diagrammatic proof sketch (Figure 2.39) given in Hales' paper (originating with Thomassen [Tho92[Tho92]Thomassen, CarstenThe Jordan-Schönflies theorem and the classification of surfaces (1992)The American Mathematical Monthly(link)]). This sketch is more understandable to me but I am less confident in it being a correct proof (e.g., maybe there is some curious fractal curve that causes the diagrammatic proofs to stop being obvious...). In the special case of the curve being a polygon, the proof involves "walking along a simple polygonal arc (close to but not intersecting )" and Hales notes:

Nobody doubts the correctness of this argument. Every mathematician knows how to walk without running in to walls. Detailed figures indicating how to "walk along a simple polygonal arc" would be superfluous, if not downright insulting. Yet, it is quite another matter altogether to train a computer to run around a maze-like polygon without collisions...

These observations demonstrate how one's confidence in a mathematical result is not merely a formal affair, but includes ostensibly informal arguments of correctness. This corroborates the attitude taken by De Millo et al in Section 2.5.1. Additionally, as noted in Section 1.1, confidence in results also includes a social component: a mathematician will be more confident that a result is correct if that result is well established within the field.

There has also been some empirical work on the question of confidence in proofs. Inglis and Alcock [QED[QED]Inglis, Matthew; Alcock, LaraExpert and novice approaches to reading mathematical proofs (2012)Journal for Research in Mathematics Education(link)] performed an empirical study on eye movements in undergrads vs postgrads. A set of undergraduates and post-graduate researchers were presented with a set of natural language proofs and then asked to judge the validity of these proofs. The main outcomes they suggest from their work are that mathematicians can disagree about the validity of even short proofs and that post-graduates read proofs in a different way to undergraduates: moving their focus back and forth more. This suggests that we might expect undergraduates and postgraduates to present different reasons for their confidence in the questions.

2.5.3. Understandability and confidence within automated theorem proving.

The concepts of understandability and confidence have also been studied empirically within the context of proof assistants. This will be picked up in Chapter 6.

Stenning et al. [SCO95[SCO95]Stenning, Keith; Cox, Richard; Oberlander, JonContrasting the cognitive effects of graphical and sentential logic teaching: reasoning, representation and individual differences (1995)Language and Cognitive Processes(link)] used the graphical Hyperproof software (also discussed in Section 5.1) to compare graphical and sentence-based representations in the teaching of logic. They found that both representations had similar transferabilityThat is, do lessons learnt in one domain transfer to anologous problems in other domains? The psychological literature identifies this as a difficult problem in teaching. and that the best teaching representation (in terms of test scores) was largely dependent on the individual differences between the students. This suggests that in looking for what it means for a proof to be understandable, we should not forget that people have different ways of thinking about proofs, and so there is not going to be a one-size-fits-all solution. It also suggests that providing multiple ways of conceptualising problems should help with understandability.

In Grebing's thesis [Gre19[Gre19]Grebing, Sarah CaeciliaUser Interaction in Deductive Interactive Program Verification (2019)PhD thesis (Karlsruhe Institute of Technology)(link)], a set of focus group studies are conducted to ask a set of users with a variety of experience-levels in Isabelle and KeY, to reflect on the user interfaces. One of her main findings was that due to the extensive levels of automation in the proving process, there can arise a 'gap' between the user's model of the proof state and the proof state created through the automation. Grebing then provides a bridge for this gap in the form of a proof scripting language and user interface for the KeY prover at a higher level of abstraction than the existing interface. Grebing also provides a review of other empirical studies conducted on the user interfaces of proof assistants [Gre19 §6.2.0].

2.6. Human-like reasoning

How should a prover work to produce human-like mathematical reasoning? The easiest answer is: however humans think it should reason!

The very earliest provers such as the Boyer-Moore theorem prover [BM73[BM73]Boyer, Robert S.; Moore, J. StrotherProving Theorems about LISP Functions (1973)IJCAI(link), BM90[BM90]Boyer, Robert S; Moore, J StrotherA theorem prover for a computational logic (1990)International Conference on Automated Deduction(link), BKM95[BKM95]Boyer, Robert S; Kaufmann, Matt; Moore, J StrotherThe Boyer-Moore theorem prover and its interactive enhancement (1995)Computers & Mathematics with Applications] take this approach to some extent; the design is steered through a process of introspection on how the authors would prove theorems. Nevertheless, with their 'waterfall' architecture, the main purpose is to prove theorems automatically, rather than creating proofs that a human could follow. Indeed Robinson's machine-like resolution method [BG01[BG01]Bachmair, Leo; Ganzinger, HaraldResolution theorem proving (2001)Handbook of automated reasoning(link)] was such a dominant approach that Bledsoe titled his paper non-resolution theorem proving [Ble81[Ble81]Bledsoe, Woodrow WNon-resolution theorem proving (1981)Readings in Artificial Intelligence(link)]. In this paper, Bledsoe sought to show another side of automated theorem proving through a review of alternative methods to resolution. A quote from this paper stands out for our current study:

It was in trying to prove a rather simple theorem in set theory by paramodulation and resolution, where the program was experiencing a great deal of difficulty, that we became convinced that we were on the wrong track. The addition of a few semantically oriented rewrite rules and subgoaling procedures made the proof of this theorem, as well as similar theorems in elementary set theory, very easy for the computer. Put simply: the computer was not doing what the human would do in proving this theorem. When we instructed it to proceed in a "human-like" way, it easily succeeded. Other researchers were having similar experiences.

This quote captures the concept of 'human-like' that I want to explore. Some piece of automation is 'human-like' when it doesn't get stuck in a way that a human would not.

Another early work on human-oriented reasoning is that of Nevins [Nev74[Nev74]Nevins, Arthur JA human oriented logic for automatic theorem-proving (1974)Journal of the ACM(link)]. Similar to this thesis, Nevins is motivated by the desire to make proofs more understandable to mathematicians. Some examples of prover automation that are designed to perform steps that a human would take are grind for PVS [SORS01[SORS01]Shankar, Natarajan; Owre, Sam; Rushby, John M; et al.PVS prover guide (2001)Computer Science Laboratory, SRI International, Menlo Park, CA(link)] and the waterfall algorithm in ACL2 [KMM13[KMM13]Kaufmann, Matt; Manolios, Panagiotis; Moore, J StrotherComputer-aided reasoning: ACL2 case studies (2013)publisher Springer].

All of the systems mentioned so far came very early in the history of computing, and had a miniscule proportion of the computing power available to us today. Today, the concern that a piece of automation may not find a solution in a human-like way or finds a circumlocuitous route to a proof is less of a concern because computers are much more powerful. However I think that the resource constraints that these early pioneers faced provides some clarity on why building human-like reasoning systems matters. The designers of these early systems were forced to introspect carefully on how they themselves were able to prove certain theorems without needing to perform a large amount of compute, and then incorporated these human-inspired insights in to their designs.

My own journey into this field started with reading the work of Gowers and Ganesalingam (G&G) in their Robot prover [GG17[GG17]Ganesalingam, Mohan; Gowers, W. T.A fully automatic theorem prover with human-style output (2017)Journal of Automated Reasoning(link)]A working fork of this can be found at https://github.com/edayers/robotone.. G&G's motivation was to find a formal system that better represented the way that a human mathematician would solve a mathematics problem, demonstrating this through the ability to generate realistic natural-language write-ups of these proofs. The system made use of a natural-deduction style hierarchical proof-state with structural sharing. The inference rules (which they refer to as 'moves') on these states and the order in which they were invoked were carefully chosen through an introspective process. The advantage of this approach is that the resulting proofs could be used to produce convincing natural language write-ups of the proofs. However, the system was not formalised and was limited to the domains hard-coded in to the system. The work in this thesis is a reimagining of this system within a formalised ITP system.

A different approach to exploring human-like reasoning is by modelling the process of mathematical discourse. Pease, Cornelli, Martin, et al [CMM+17[CMM+17]Corneli, Joseph; Martin, Ursula; Murray-Rust, Dave; et al.Modelling the way mathematics is actually done (2017)Proceedings of the 5th ACM SIGPLAN International Workshop on Functional Art, Music, Modeling, and Design(link), PLB+17[PLB+17]Pease, Alison; Lawrence, John; Budzynska, Katarzyna; et al.Lakatos-style collaborative mathematics through dialectical, structured and abstract argumentation (2017)Artificial Intelligence(link)] have investigated the use of graphical discourse models of mathematical reasoning. In this thesis, however I have restricted the scope to human-like methods for solving simple lemmas that can produce machine-checkable proofs.

Figure 2.40

A visual representation of summing the first integers with counters. The lower black triangle's rows comprise , , , , from which a human can quickly see .

Another key way in which humans reason is through the use of diagrams [Jam01[Jam01]Jamnik, MatejaMathematical Reasoning with Diagrams: From Intuition to Automation (2001)publisher CSLI Press(link)] and alternative representations of mathematical proofs. A prima facie unintuitive result such as snaps together when presented with the appropriate representation in Figure 2.40. Jamnik's previous work explores how one can perform automated reasoning like this in the domain of diagrams Some recent work investigating and automating this process is the rep2rep project [RSS+20]. This is an important feature of general human-like reasoning, however in the name of scope management I will not explore representations further in this thesis.

2.6.1. Levels of abstraction

There have been many previous works which add higher-level abstraction layers atop an existing prover with the aim of making a prover that is more human-like.

Archer et al. developed the TAME system for the PVS prover [AH97[AH97]Archer, Myla; Heitmeyer, ConstanceHuman-style theorem proving using PVS (1997)International Conference on Theorem Proving in Higher Order Logics(link)]. Although they were focussed on proving facts about software rather than mathematics, the goals are similar: they wish to create software that produces proofs which are natural to humans. TAME makes use of a higher abstraction level. However, it is only applied to reasoning about timed automata and doesn't include a user study.

As part of the auto2 prover tactic for Isabelle, Zhan [Zha16[Zha16]Zhan, BohuaAUTO2, a saturation-based heuristic prover for higher-order logic (2016)International Conference on Interactive Theorem Proving(link)] developed a high-level proof script syntax to guide the automation of auto2. A script takes the form of asserting several intermediate facts for the prover to prove before proving the main goal. This script is used to steer the auto2 prover towards proving the result. This contrasts with tactic-based proof and structural scripts (e.g. Isar [Wen99]) which are instead instructions for chaining together tactics. With the auto2 style script, it is possible to omit a lot of the detail that would be required by tactic-based scripts, since steps and intermediate goals that are easy for the automation to solve can be omitted entirely. A positive of this approach is that by being implemented within the Isabelle theorem prover, the results of auto2 are checked by a kernel. However it is not a design goal of auto2 to produce proofs that a human can read.

2.6.2. Proof planning

Proof planning originated with Bundy [Bun88[Bun88]Bundy, AlanThe use of explicit plans to guide inductive proofs (1988)International conference on automated deduction(link), Bun98[Bun98]Bundy, AlanProof Planning (1998)publisher University of Edinburgh, Department of Artificial Intelligence(link)] and is the application of performing a proof with respect to a high-level plan (e.g., I am going to perform induction then simplify terms) that is generated before low-level operations commence (performing induction, running simplification algorithms). The approach follows the general field of AI planning.

AI planning in its most general conception [KKY95[KKY95]Kambhampati, Subbarao; Knoblock, Craig A; Yang, QiangPlanning as refinement search: A unified framework for evaluating design tradeoffs in partial-order planning (1995)Artificial Intelligence(link)] is the process of searching a graph G using plan-space rather than by searching it directly. In a typical planning system, each point in plan-space is a DAGDirected Acyclic Graph of objects called ground operators or methods, each of which has a mapping to paths in G. Each ground operator is equipped with predicates on the vertices of G called pre/post-conditions. Various AI planning methods such as GRAPHPLAN [BF97] can be employed to discover a partial ordering of these methods, which can then be used to construct a path in G. This procedure applied to the problem of finding proofs is proof planning. The main issue with proof planning [Bun02] is that it is difficult to identify sets of conditions and methods that do not cause the plan space to be too large or disconnected. However, in this thesis we are not trying to construct plans for entire proofs, but just to model the thought processes of humans when solving simple equalities. A comparison of the various proof planners is provided by Dennis, Jamnik and Pollet [DJP06].

Proof planning in the domain of finding equalities frequently involves a technique called rippling [BSV+93[BSV+93]Bundy, Alan; Stevens, Andrew; Van Harmelen, Frank; et al.Rippling: A heuristic for guiding inductive proofs (1993)Artificial Intelligence(link), BBHI05[BBHI05]Bundy, Alan; Basin, David; Hutter, Dieter; et al.Rippling: meta-level guidance for mathematical reasoning (2005)publisher Cambridge University Press(link)], in which an expression is annotated with additional structure determined by the differences between the two sides of the equation that directs the rewriting process. The rippling algorithm captures some human intuitions about which parts of a rewriting expression are salient. In the system for equational rewriting I introduce in Chapter 4, I avoid using rippling because the techniques are tied to peforming induction.

Another technique associated with proof planning is the concept of proof critics [Ire92[Ire92]Ireland, AndrewThe use of planning critics in mechanizing inductive proofs (1992)International Conference on Logic for Programming Artificial Intelligence and Reasoning(link)]. Proof critics are programs which take advantage of the information from a failed proof plan to construct a new, amended proof plan. An interactive version of proof critics has also been developed [IJR99]. In the work in Chapter 3, this concept of revising a proof based on a failure is used.

Another general AI system that will be relevant to this thesis is hierarchical task networks [MS99[MS99]Melis, Erica; Siekmann, JörgKnowledge-based proof planning (1999)Artificial Intelligence(link), Tat77[Tat77]Tate, AustinGenerating project networks (1977)Proceedings of the 5th International Joint Conference on Artificial Intelligence.(link)] which are used to drive the behaviour of artificial agents such as the ICARUS architecture [LCT08]. In a hierarchical task network, tasks are recursively refined into subtasks, which are then used to find fine-grained methods for achieving the original tasks, eventually bottoming out in atomic actions such as actuating a motor. HTNs naturally lend themselves to human-like reasoning, and I will make use of these in designing a hierarchical algorithm for performing equational reasoning.

2.7. Natural language for formal mathematics

In this section I will survey the background and related work on using natural language to generate proofs. The material in this chapter will be used in Section 3.6 and Chapter 6.

2.7.1. Natural language generation in a wider context

Data-to-text natural language generation (NLG) is a subfield of natural language processing (NLP) that focusses on the problem of computing intelligible natural language discourses and text from some non-textual object (without a human in the loop!). An example is producing an English description of the local weather forecast from meteorological data. NLG techniques can range from simple 'canned text' and 'mail-merge' applications right up to systems with aspirations of generality such as modern voice recognition in smartphones.

There are a wide variety of architectures available for modern NLG [GK18[GK18]Gatt, Albert; Krahmer, EmielSurvey of the state of the art in natural language generation: Core tasks, applications and evaluation (2018)Journal of Artificial Intelligence Research(link)], however they usually carry a modular structure, with a backbone [RD00] being split in to three pipeline stages as shown in Figure 2.41.

Figure 2.41

Outline of a common architecture for general NLG systems.

[RD00]Reiter, Ehud; Dale, RobertBuilding natural language generation systems (2000)publisher Cambridge University Press(link)

These choices of stages are mainly motivated through a desire to reuse code and to separate concerns (a realiser does not need to know the subject of the text it is correcting the punctuation from). I make use of this architecture in Section 3.6.

An alternative approach to the one outlined above is to use statistical methods for natural language generation. The advent of scalable machine learning (ML) and neural networks (NNs) of the 2010s has gained dominance in many NLG tasks such as translation and scene description. The system developed for this work in Section 3.6 is purely classical, with no machine learning component. In the context of producing simple write-ups of proofs, there will likely be some gains from including ML, but it is not clear that a statistical approach to NLG is going to assist in building understandable descriptions of proofs, because it is difficult to formally confirm that the resulting text generated by a black-box NLG component is going to accurately reflect the input.

2.7.2. Natural language generation for mathematics

The first modern study of the linguistics of natural language mathematics is the work of Ranta [Ran94[Ran94]Ranta, AarneSyntactic categories in the language of mathematics (1994)International Workshop on Types for Proofs and Programs(link), Ran95[Ran95]Ranta, AarneContext-relative syntactic categories and the formalization of mathematical text (1995)International Workshop on Types for Proofs and Programs(link)] concerning the translation between dependent type theory and natural language and I will use some of his insights in Section 3.6. Ganesalingam's thesis [Gan10[Gan10]Ganesalingam, MohanThe language of mathematics (2010)PhD thesis (University of Cambridge)(link)] is an excellent reference for understanding the linguistics of mathematics in general, however it is more concerned with natural language input.

There have been numerous previous attempts at creating natural language output from a theorem prover: Felty-Miller [FM87[FM87]Felty, Amy; Miller, DaleProof explanation and revision (1987)Technical Report(link)], Holland-Minkley et al within the NuPrl prover [HBC99[HBC99]Holland-Minkley, Amanda M; Barzilay, Regina; Constable, Robert LVerbalization of High-Level Formal Proofs. (1999)AAAI/IAAI(link)], and also in Theorema [BCJ+06[BCJ+06]Buchberger, Bruno; Crǎciun, Adrian; Jebelean, Tudor; et al.Theorema: Towards computer-aided mathematical theory exploration (2006)Journal of Applied Logic(link)]. A particularly advanced NLG for provers was Proverb [HF97[HF97]Huang, Xiaorong; Fiedler, ArminProof Verbalization as an Application of NLG (1997)International Joint Conference on Artificial Intelligence(link)] for the Ωmega theorem prover [BCF+97[BCF+97]Benzmüller, Christoph; Cheikhrouhou, Lassaad; Fehrer, Detlef; et al.Ωmega: Towards a mathematical assistant (1997)Automated Deduction - CADE-14(link)], this system's architecture uses the pipeline in Figure 2.41 and takes as input a proof term generated by the Ωmega toolchain and outputs a natural language sentence. An issue with these generation tools is that their text will often produce text that does not appear natural at the macro-level. That is, the general structure of the argument will be different to what would be found in a mathematical textbook. G&G illustrate some examples of this in their paper [GG17 §2].

The process of synthesising natural language is difficult in the general case. But as G&G [GG17] note, the language found in mathematical proofs is much more restricted than a general English text. At its most basic, a natural language proof is little more than a string of facts from the assumptions to the conclusion. There is no need for time-sensitive tenses or other complexities that arise in general text. Proofs are written this way because mathematical proofs are written to be checked by a human and so a uniformity of prose is used that minimises the chance of 'bugs' creeping in. This, combined with a development calculus designed to encourage human-like proof steps, makes the problem of creating mathematical natural language write-ups much more tenable. I will refer to these non-machine-learning approaches as 'classical' NLG.

A related problem worth mentioning here is the reverse process of NLG: parsing formal proofs and theorem statements from a natural language text. The two problems are interlinked in that they are both operating on the same grammar and semantics, but parsing raises a distinct set of problems to NLG, particularly around ambiguity [Gan10 ch. 2]. Within mathematical parsing there are two approaches. The first approach is controlled natural language [Kuh14[Kuh14]Kuhn, TobiasA survey and classification of controlled natural languages (2014)Computational linguistics(link)] as practiced by ForTheL [Pas07[Pas07]Paskevich, AndreiThe syntax and semantics of the ForTheL language (2007)PhD thesis (Université Paris XII)(link)] and Naproche/SAD [CFK+09[CFK+09]Cramer, Marcos; Fisseni, Bernhard; Koepke, Peter; et al.The Naproche Project: Controlled Natural Language Proof Checking of Mathematical Texts (2009)Controlled Natural Language, Workshop on Controlled Natural Language(link)]. Here, a grammar is specified to parse text that is designed to look as close to a natural langauge version of the text as possible. The other approach (which I will not make use of in this thesis) is in using machine learning techniques, for example the work on parsing natural mathematical texts is in the work of Stathopoulos et al [ST16[ST16]Stathopoulos, Yiannos A; Teufel, SimoneMathematical information retrieval based on type embeddings and query expansion (2016)COLING 2016(link), SBRT18[SBRT18]Stathopoulos, Yiannos; Baker, Simon; Rei, Marek; et al.Variable Typing: Assigning Meaning to Variables in Mathematical Text (2018)NAACL-HLT 2018(link)].

In Section 3.6 I will make use of some ideas from natural language parsing, particularly the concept called notion by ForTheL and non-extensional type by Ganesalingam. A non-extensional type is a noun-phrase such as "element of a topological space" or "number" which is assigned to expressions, these types are not used by the underlying logical foundation but are used to parse mathematical text. To see why this is needed consider the syntax x y. This is parsed to an expression differently depending on the types of x and y (e.g., if x is a function vs. an element of a group). Non-extensional types allow this parse to be disambiguated even if the underlying foundational language does not have a concept of a type.

2.8. Chapter summary

In this chapter I have provided the necessary background information and prior work needed to frame the rest of the thesis. I have explained the general design of proof assistants (Section 2.1). I have described a meta-level pseudolanguage for constructing algorithms (Section 2.2) and provided some gadgets for working with inductive types within it (Section 2.3). I have also presented the philosophy and social aspects of understandability in mathematics (Section 2.5); human-like automated reasoning (Section 2.6); and natural language generation of mathematical text (Section 2.7).

Chapter 3
A development calculus

Now that we have reviewed the requisite background material, I can define the moving parts of a human-like theorem prover. The driving principle is to find ways of representing proofs at the same level of detail that a human mathematician would use to communicate to colleagues.

The contributions of this chapter are:

HumanProof integrates with an existing proof assistant (in this case Lean). By plugging in to an existing prover, it is possible to gain leverage by utilising the already developed infrastructure for that prover such as parsers, tactics and automation. Using an existing prover also means that the verification of proofs can be outsourced to the prover's kernel.

The first research question of Section 1.2 was to investigate what it means for a proof to be human-like. I provided a review to answer this question in Section 2.6. Humans think differently to each other, and I do not wish to state that there is a 'right' way to perform mathematics. However, I argue that there are certain ways in which the current methods for performing ITP should be closer to the general cluster of ways in which humans talk about and solve problems.

In this chapter I investigate some ways in which the inference rules that provers use could be made more human-like, and then introduce a new proving abstraction layer, HumanProof, written in the Lean 3 theorem prover, implementing these ideas. Later, in Chapter 6, I gather thoughts and ratings from real mathematicians about the extent to which the developed system achieves these goals.

In Section 3.1, I first present an example proof produced by a human to highlight the key features of 'human-like' reasoning that I wish to emulate. Then in Section 3.2 I give an overview of the resulting designs and underline the primary design decisions and the evidence that drives them. In Section 3.3 I provide the details and theory of how the system works through defining the key Box structure and tactics on Boxes. The theory behind creating valid proof terms from Boxes is presented in Section 3.4 as well as how to run standard tactics within Boxes (Section 3.4.4). This theoretical basis will then be used to define the human-like tactics in Section 3.5. Then, I will detail the natural language generation pipeline for HumanProof in Section 3.6.

3.1. Motivation

Building on the background where I explored the literature on the definition of 'human-like' (Section 2.6) and 'understandable' (Section 2.5.1) proofs, my goal in this section is find some specific improvements to the way in which computer aided mathematics is done. I use these improvements to motivate the design choices of the HumanProof system.

3.1.1. The need for human-like systems

In Section 1.1, I noted that non-specialist mathematicians have yet to widely accept proof assistants despite the adoption of other tools such as computer algebra systems. Section 1.1 presented three problems that mathematicians have with theorem provers: differing attitudes on correctness, a high learning cost to learning to use ITP and a low resulting reward -- learning the truth of something that they 'knew' was true anyway. One way in which to improve this situation is to reduce the cost of learning to use proof assistants through making the way in which they process proofs more similar to how a human would process proofs, making the proofs more closely match what the mathematician already knows. Making a prover which mimics a human's thought process also helps overcome the problem of differing attitudes of correctness.

Requiring a human-like approach to reasoning means that many automated reasoning methods such as SMT-solvers and resolution (see Section 2.6) must be ruled out. In these machine-oriented methods, the original statement of the proposition to be proved is first reduced to a normal form and mechanically manipulated with a small set of inference rules. The resulting proof is scarcely recognisable to a mathematician as a proof of the proposition, even if it is accepted by the kernel of a proof assistant. As discussed in Section 1.1, Section 2.5 and as will be confirmed in Chapter 6, mathematicians do not care just about a certificate that a statement is correct but also about the way in which the statement is correct.

Given some new way of creating proofs; how can we determine whether these created proofs are more 'human-like' than some other system? The way I propose here is to require that the program be able to imitate the reasoning of humans at least well enough to produce convincing natural language write-ups of the proofs that it generates, and then to test how convincing these write-ups are through asking mathematicians. This approach is shared by the previous work of Gowers and Ganesalingam [GG17[GG17]Ganesalingam, Mohan; Gowers, W. T.A fully automatic theorem prover with human-style output (2017)Journal of Automated Reasoning(link)] Gowers and Ganesalingam is abbreviated G&G., where they use a similar framework to the HumanProof system presented in this thesis to produce natural language write-ups of proofs for some lemmas in the domain of metric space topology. The work presented in this thesis builds significantly on the work of G&G.

3.1.2. Modelling human-like reasoning

One of the key insights of Gowers and Ganesalingam is that humans reason with a different 'basis' of methods than the logical operations and tactics that are provided to the user of an ITP. For example, a hypothesis such as a function being continuous expands to a formula (3.1) with interlaced quantifiers.


Definition of a continuous function for metric spaces , . Here is the distance metric for or .

However in a mathematical text, if one needs to prove , the hypothesis that is continuous will be applied in one go. That is, a step involving (3.1) would be written as "Since is continuous, there exists a such that whenever ". Whereas in an ITP this process will need to be separated in to several steps: first show , then obtain , then show .

Another example with the opposite problem is the automated tactics such as the tableaux prover blast [Pau99[Pau99]Paulson, Lawrence CA generic tableau prover and its integration with Isabelle (1999)Journal of Universal Computer Science(link)]. The issue with tactics is that their process is opaque and leaves little explanation for why they succeed or fail. They may also step over multiple stages that a human would rather see spelled out in full. The most common occurrence of this is in definition expansion; two terms may be identical modulo definition expansion but a proof found in a textbook will often take the time to point out when such an expansion takes place.

This points towards creating a new set of inference rules for constructing proofs that are better suited for creating proofs by corresponding better to a particular reasoning step as might be used by a human mathematician.

3.1.3. Structural sharing

Structural sharing is defined as making use of the same substructure multiple times in a larger structure. For example, a tree with two branches being the same would be using structural sharing if the sub-branches used the same object in memory. Structural sharing of this form is used frequently in immutable datastructures for efficiency. However here I am interested in whether structural sharing has any applications in human-like reasoning.

When humans reason about mathematical proofs, they often flip between forwards reasoning and backwards reasoningBroadly speaking, forwards reasoning is any mode of modifying the goal state that acts only on the hypotheses of the proof state. Whereas backwards reasoning modifies the goals.. The goal-centric proof state used by ITPs can make this kind of reasoning difficult. In the most simple example, suppose that the goal is PQQPThat is, given the hypothesis PQ, prove QP where P and Q are propositions and is the logical-and operation.. One solution is to perform a split on the goal to produce PQQ and PQP. However, performing a conjunction elimination on the PQ hypothesis will then need to be performed on both of the new goals. This is avoided if the elimination is performed before splitting PQ. In this simplified example it is clear which order the forwards and backwards reasoning should be performed. But in more complex proofs, it may be difficult to see ahead how to proceed. A series of backwards reasoning steps may provide a clue as to how forwards reasoning should be applied. The usual way that this problem is solved is for the human to edit an earlier part of the proof script with the forwards reasoning step on discovering this. I reject this solution because it means that the resulting proof script no longer represents the reasoning process of the creator. The fact that the forwards reasoning step was motivated by the goal state at a later point is lost.

The need to share structure among objects in the name of efficiency has been studied at least as far back as Boyer and Moore [BM72[BM72]Boyer, R. S.; Moore, J. S.The sharing structure in theorem-proving programs (1972)Machine intelligence(link)]. However, the motivation behind introducing it here is purely for the purpose of creating human-like proofs.

The solution that I propose here is to use a different representation of the goal state that allows for structural sharing. This alteration puts the proof state calculus more in the camp of OLEG [McB00[McB00]McBride, ConorDependently typed functional programs and their proofs (2000)PhD thesis (University of Edinburgh)(link)], and the G&G prover. The details of the implementation of structural sharing are presented later in Section 3.5.4.

Structural sharing can also be used to implement backtracking and counterfactuals. For example, suppose that we need to prove APQ, one could apply the -left-introduction rule PPQ, but then one might need to backtrack later in the event that really the right-introduction rule QPQ should be used instead. Structural sharing lets us split a goal into two counterfactuals.

3.1.4. Verification

One of the key benefits of proof assistants is that they can rigorously check whether a proof is correct. This distinguishes the HumanProof project from the prior work of G&G, where no formal proof checking was present. While I have argued in Section 2.5 (and will later be suggested from the results of my user study in Section 6.6) that this guarantee of correctness is less important for attracting working mathematicians, there need not be a conflict between having a prover which is easy for non-specialists to understand and which is formally verified.

3.1.5. What about proof planning?

Proof planning is the process of creating proofs using abstract proof methods that are assembled with the use of classical AI planning algorithms[RN10]Russell, Stuart J.; Norvig, PeterArtificial Intelligence - A Modern Approach (2010)publisher Pearson Education(link)An introduction to classical AI planning can be found in Russel and Norvig [RN10 Pt.III].. The concept of proof planning was first introduced by Bundy [Bun88[Bun88]Bundy, AlanThe use of explicit plans to guide inductive proofs (1988)International conference on automated deduction(link)]. A review of proof planning is given in Section 2.6.2. The advantage of proof planning is that it represents the way in which a problem will be solved at a much more abstract level, more like human mathematicians.

The primary issue with proof planning is that there is a sharp learning curve. In order to get started with proof plans, one must learn a great deal of terminology and a new way of thinking about formalised mathematics. The user has to familiarise themselves with the way in which proof methods are used to construct proof plans and how to diagnose malformed plans for their particular problems. Bundy presents his own critique of proof planning [Bun02[Bun02]Bundy, AlanA critique of proof planning (2002)Computational Logic: Logic Programming and Beyond(link)] which goes in to more detail on this point.

The study of proof planning has fallen out of favour for the 21st century so far, possibly in relation to the rise of practical SMT solvers such as E prover [SCV19[SCV19]Schulz, Stephan; Cruanes, Simon; Vukmirović, PetarFaster, Higher, Stronger: E 2.3 (2019)Proc. of the 27th CADE, Natal, Brasil(link)] and Z3 prover [MB08[MB08]de Moura, Leonardo; Bjørner, NikolajZ3: An efficient SMT solver (2008)International conference on Tools and Algorithms for the Construction and Analysis of Systems(link)] and their incorporation in to ITP through the use of 'hammer' software like Isabelle's Sledgehammer [BN10[BN10]Böhme, Sascha; Nipkow, TobiasSledgehammer: judgement day (2010)International Joint Conference on Automated Reasoning(link)]. I share a great deal of the ideals that directed proof planning and the equational reasoning system presented in Chapter 4 is inspired by it. I take a more practical stance; the additional abstractions that are placed atop the underlying tactic system should be transparent, in that they are understandable without needing to be familiar with proof planning and with easy 'escape hatches' back to the tactic world if needed. This design goal is similar to that of the X-Barnacle prover interface [LD97[LD97]Lowe, Helen; Duncan, DavidXBarnacle: Making Theorem Provers More Accessible (1997)14th International Conference on Automated Deduction(link)] (discussed later in Section 5.1), where a GUI is used to present an explorable representation of a proof plan.

3.2. Overview of the software

The software implementation of the work presented in this thesis is called 'HumanProof' and is implemented using the Lean 3 prover. The source code can be found at https://github.com/edayers/lean-humanproof-thesis. In this section I give a high-level overview of the system and some example screenshots. A general overview of the system and how it relates to the underlying Lean theorem prover is shown in Figure 3.2.

Figure 3.2

High-level overview of the main modules that comprise the HumanProof system and how these interface with Lean, ProofWidgets and the VSCode text editor. The green parts of the diagram are contributions given in this thesis. ProofWidgets (Chapter 5) was spun out from HumanProof for use as a general-purpose GUI system so that it could be used in other community projects (see Figure 5.18).

Given a theorem to prove, HumanProof is invoked by indicating a special begin [hp] script block in the proof document (see Figure 3.3). This initialises HumanProof's Box datastructure with the assumptions and goal proposition of the proof. The initial state of the prover is shown in the goal view of the development environment, called the Info View (the right panel of Figure 3.3). Using the ProofWidgets framework (developed in Chapter 5), this display of the state is interactive: the user can click at various points in the document to determine their next steps. Users can then manipulate this datastructure either through the use of interactive buttons or by typing commands in to the proof script in the editor. In the event of clicking the buttons, the commands are immediately added to the proof script sourcefile as if the user had typed it themselves (the left panel of Figure 3.3). In this way, the user can create proofs interactively whilst still preserving the plaintext proof document as the single-source-of-truth; this ensures that there is no hidden state in the interactive view that is needed for the Lean to reconstruct a proof of the statement. While the proof is being created, the system also produces a natural language write-up (labelled 'natural language writeup' in Figure 3.2) of the proof (Section 3.6) that is displayed alongside the proof state. As the proof progresses, users can see the incomplete natural language proof get longer too.

The system also comes equipped with a module for solving equalities using the 'subtasks algorithm' (Chapter 4); labelled 'subtasks' on Figure 3.2. The subtasks algorithm uses a hierarchical planning (see Section 2.6.2) system to produce an equality proof that is intended to match the way that a human would create the proof, as opposed to a more machine like approach such as E-matching [BN98[BN98]Baader, Franz; Nipkow, TobiasTerm rewriting and all that (1998)publisher Cambridge University Press(link) Ch. 10]. The output of this subsystem is a chain of equations that is inserted into the natural language writeup.

Figure 3.3

Screenshot of HumanProof in action on a test lemma. To the left is the code editor. The user invokes HumanProof with the begin [hp] command. The blue apply H button can be clicked to automatically insert more proofscript.

3.3. The Box datastructure

At the heart of HumanProof is a development calculus using a datastructure called Box. The considerations from Section 3.1.3 led to the development of an 'on-tree' development calculus. Rather than storing a flat list of goals and a metavariable context alongside the result, the entire development state is stored in a recursive tree structure which I call a Box. The box tree, to be defined in Section 3.3.2, stores the proof state as an incomplete proof tree with on-tree metavariable declarations which is then presented to the user as a nested set of boxes.

3.3.1. An example of Box in action.

Before defining boxes in Section 3.3.2, let's look at a simple example. Boxes are visualised as a tree of natural-deduction-style goal states. Let's start with a minimal example to get a feel for the general progression of a proof with the Box architecture. Let's prove PQQP using Boxes. The initial box takes the form (3.4).


?𝑡 : PQQP

And we can read (3.4) as saying "we need to show PQQP". The ?𝑡 is the name of the metavariable that the proof of this will be assigned to. The first action is to perform an intro step to get (3.5).

𝑕 : PQ

?𝑡: QP

To be read as "Given PQ, we need to show QP". So far the structure is the same as would be observed on a flat goal list structure. The idea is that everything above a horizontal line is a hypothesis (something that we have) and everything below is a goal (something we want). When all of the goals are solved, we should have a valid proof of the original goal. At this point, we would typically perform an elimination step on (e.g., cases in Lean) (3.6).


𝑕₁ : P

?𝑡₁: QP
𝑕₂ : Q

?𝑡₂: QP

Here in (3.6) we can see nested boxes, each nested box below the horizontal line must be solved to solve the parent box. However, in the box architecture there is an additional step available; a branching on the goal (3.7).

𝑕 : PQ

?𝑡₁ : Q

?𝑡₂ : P

If a pair of boxes appear with a between them, then either of the boxes can be solved to solve the parent box. And then we can eliminate h on the branched box:


𝑕₁ : P

?𝑡₁₁ : Q

?𝑡₁₂ : P
𝑕₂ : Q

?𝑡₂₁ : Q

?𝑡₂₂ : P

Now at this point, we can directly match 𝑕₁ with ?𝑡₁₂ and 𝑕₂ with ?𝑡₂₁ to solve the box. Behind the scenes, the box is also producing a result proof term that can be checked by the proof assistant's kernel.

3.3.2. Definition of Box

The above formulation is intended to match with the architecture designed in G&G, so that all of the same proof-steps developed in G&G are available. Unlike G&G, the system also interfaces with a flat goal-based development calculus, and so it is possible to use both G&G proof-steps and Lean tactics within the same development calculus. To do this, let's formalise the system presented above in Section 3.3.1 with the following Box datatype (3.9). Define a Binder := (name : Name) × (type : Expr) to be a name identifier and a type with notation (nametype), using a smaller colon to keep the distinction from a meta-level type annotation.


Inductive definition of Box.

Box ::=
|(x : Binder) (b : Box) : Box
| 𝒢 (m : Binder) (b : Box) : Box
| 𝒭 (r : Expr) : Box
| 𝒜 (b: Box) (r : Binder) (b: Box) : Box
| 𝒪 (b: Box) (b: Box) : Box
| 𝒱 (x : Binder) (t : Expr) (b : Box) : Box

I will represent instances of the Box type with a 2D box notation defined in (3.10) to make the connotations of the datastructure more apparent.


Visualisation rules for the Box type. Each visualisation rule takes a pair 𝐿𝑅 where 𝐿 is a constructor for Box and 𝑅 is the visualisation. Everything above the horizontal line in the box is called a hypothesis. Everything below a line within a box is a 𝒢-box, called a goal. This visualisation is also implemented in Lean using the widgets framework presented in Section 5.8.

(𝑥α) 𝑏
𝑥 : α
𝒢 (𝑥α) 𝑏

?𝑥 : α
𝒭 𝑟
𝒜 𝑏₁ (𝑥α) 𝑏₂

[𝑥 :=]
𝒪 𝑏₁ 𝑏₂

𝒱 (𝑥α) 𝑡
𝑥 := 𝑡

These visualisations are also presented directly to the user through the use of the widgets UI framework presented in Chapter 5. The details of this visualisation are given in Section 5.8.

To summarise the roles for each constructor:

Boxes also have a set of well-formed conditions designed to follow the typing judgements of the underlying proof-assistant development calculus. This will be developed in Section 3.4.

3.3.3. Initialising and terminating a Box

Given an expression representing a theorem statement P : Expr, ∅ ⊢ PProp, we can initialise a box to solve P as 𝑏₀ := 𝒢 (𝑡P) (𝒭 𝑡) (3.11).


Initial 𝑏₀ : Box given PProp.

?𝑡 : P

In the case that P also depends on a context of hypotheses ΓPProp, these can be incorporated by prepending to the initial 𝑏₀ in (3.11) with an box for each 𝑕Γ. For example, if Γ = [(𝑥α), (𝑦β)] then send 𝑏₀ to (𝑥α),(𝑦β), 𝑏₀.

Say that a Box is solved when there are no 𝒢-binders remaining in the Box. At this point, the proving process ceases and a proof term and natural language proof may be generated.

3.3.4. Transforming a Box

The aim is to solve a box through the use of a sequence of sound transformations on it. Define a box-tactic is a partial function on boxes BoxTactic := BoxOption Box. Box-tactics act on Boxes in the same way that tactics act on proof states. That is, they are functions which act on a proof-state (i.e., a representation of an incomplete proof) in order to prove a theorem. This is to make it easier to describe how box-tactics interface with tactics in Section 3.4 and Appendix A.

In Section 3.3.1 we saw some examples of box-tactics to advance the box state and eventually complete it. A complete set of box-tactics that are implemented in the system will be given in Section 3.5.

As with tacticsAt least, tactics in a 'checker' style proof assistant such as Lean. See Section 2.1 for more information., there is no guarantee that a particular box-tactic will produce a sound reasoning step; some box-tactics will be nonsense (for example, a box-tactic that simply deletes a goal) and not produce sound proofs. In Section 3.4 I will define what it means for a box-tactic to be sound and produce a correct proof that can be checked by the ITP's kernel.

3.3.5. Relation to other development calculi

Thee Box calculus's design is most similar to McBride's OLEG [McB00[McB00]McBride, ConorDependently typed functional programs and their proofs (2000)PhD thesis (University of Edinburgh)(link)] and G&G's prover [GG17[GG17]Ganesalingam, Mohan; Gowers, W. T.A fully automatic theorem prover with human-style output (2017)Journal of Automated Reasoning(link)]. A more abstract treatment can be found in the work of Sterling and Harper [SH17[SH17]Sterling, Jonathan; Harper, RobertAlgebraic Foundations of Proof Refinement (2017)CoRR(link)], implemented within the RedPRL theorem prover.

The novel contribution of the Box calculus developed here is that it works within a Spiwack-style [Spi11[Spi11]Spiwack, ArnaudVerified computing in homological algebra, a journey exploring the power and limits of dependent type theory (2011)PhD thesis (INRIA)(link)]See Section 2.4 for more background information. flat metavariable context model as is used in Lean. That is, it is a layer atop the existing metavariable context system detailed in Section 2.4.3. This means that it is possible for the new calculus to work alongside an existing prover, rather than having to develop an entirely new one as was required for OLEG and the G&G prover. This choice opens many possibilities: now one can leverage many of the advanced features that Lean offers such as a full-fledged modern editor and metaprogramming toolchain [EUR+17[EUR+17]Ebner, Gabriel; Ullrich, Sebastian; Roesch, Jared; et al.A metaprogramming framework for formal verification (2017)Proceedings of the ACM on Programming Languages(link)]. This approach also reduces some of the burden of correctness pressed upon alternative theorem provers, because we can outsource correctness checking to the Lean kernel. Even with this protection, it is still frustrating when a development calculus produces an incorrect proof and so I will also provide some theoretical results in Section 3.4 and Appendix A on conditions that must be met for a proof step to be sound. The design of the Box calculus is independent of any particular features of Lean, and so a variant of it may be implemented in other systems.

The central datatype is the Box. This performs the role of holding a partially constructed proof object and a representation of the goals that remain to be solved. As discussed in Section 3.1.3, the purpose is to have a structurally shared tree of goals and assumptions that is also compatible with Lean tactics.

McBride's OLEG [McB00[McB00]McBride, ConorDependently typed functional programs and their proofs (2000)PhD thesis (University of Edinburgh)(link)] is the most similar to the design presented here. OLEG 'holes' are functionally the same as metavariables. That is, they are specially tagged variables that will eventually be assigned with expressions. OLEG provides an additional constructor for expressions called 'hole-bindings' or '-bindings'. Because OLEG is a ground-up implementation of a new theorem prover, hole-bindings can be added directly as constructors for expressions. This is not available in Lean (without reimplementing Lean expressions and all of the algorithms)It might be possible to use Lean's expression macro system to implement hole-bindings, but doing so would still require reimplementing a large number of type-context-centric algorithms such as unification [SB01].[SB01]Snyder, Wayne; Baader, FranzUnification theory (2001)Handbook of automated reasoning(link). These hole-bindings perform the same role as the 𝒢 constructor in that they provide the context of variables that the hole/metavariable is allowed to depend on. But if the only purpose of a hole-binding is to give a context, then why not just explicitly name that context as is done in other theorem provers? The Box architecture given above is intended to give the best of both worlds, in that you still get a shared goal-tree structure without needing to explicitly bind metavariables within the expression tree. Instead they are bound in a structure on top of it.

Lean and Coq's proof construction systems make use of the metavariable context approach outlined in Section 2.4. The metavariable context here performs the same role as the 𝒢 goal boxes, however this set of goals is flattened in to a list structure rather than stored in a tree as in Box. This makes many aspects such as unification easier but means that structural sharing (Section 3.1.3) is lost. In Section 3.4.4 I show that we do not have to forgo use of the algorithms implemented for a flat metavariable structure to use Boxes.

In Isabelle, proofs are constructed through manipulating the proof state directly through an LCF-style [Mil72[Mil72]Milner, RobinLogic for computable functions description of a machine implementation (1972)Technical Report(link)] kernel of available functionsAs can be seen in the source https://isabelle-dev.sketis.net/source/isabelle/browse/default/src/Pure/thm.ML.. Schematic variables are used to create partially constructed terms.

Sterling and Harper [SH17[SH17]Sterling, Jonathan; Harper, RobertAlgebraic Foundations of Proof Refinement (2017)CoRR(link)] provide a category-theoretical theory of partially constructed proofs and use these principles in the implementation of RedPRL. They are motivated by the need to create a principled way performing refinement of proofs in a dependently-typed foundation. They develop a judgement-independent framework for describing development calculi within a category-theoretical setting.

Another hierarchical proof system is HiProof [ADL10[ADL10]Aspinall, David; Denney, Ewen; Lüth, ChristophTactics for hierarchical proof (2010)Mathematics in Computer Science(link)]. HiProof makes use of a tree to write proofs. The nodes of a tree are invocations of inference rules and axioms and an edge denotes the flow of evidence in the proof. These nodes may be grouped to provide varying levels of detail. These hierarchies are used to describe a proof, whereas a Box here describes a partially completed proof and a specification of hypotheses and goals that must be set to construct the proof.

3.4. Creating valid proof terms from a Box

Note that because we are using a trusted kernel, the result of producing an invalid proof with Box is a mere inconvenience because the kernel will simply reject it. However, in order for the Box structure defined in Section 3.3.2 to be useful within a proof assistant such as Lean as motivated by Section 3.1.4, it is important to make sure that a solved Box produces a valid proof for the underlying trusted kernel. To do this, I will define a typing judgement 𝑀;Γ𝑏α and then present a method for extracting a proof term 𝑀;Γ𝑟α from 𝑏 with the same type provided 𝑏 is solved.

3.4.1. Assignability for Box

In Section 2.4.2, I introduced the concept of an assignable datastructure for generalising variable-manipulation operations to datatypes other than expressions. We can equip a datatype containing expressions with an assignability structure assign (3.12). This is a variable-context-aware traversal over the expressions present for the datatype. For Box, this traversal amounts to traversing the expressions in each box, while adding to the local context if the subtree is below a binder. The definition of assign induces a definition of variable substitution and abstraction over Boxes.


Definition of assign for Box. See Section 2.4.2 for a description of assignability. The <*> operator is the applicative product for some applicative functor M (see Section 2.2.2). Note that goal 𝒢 declarations are bound, so for the purposes of assignment they are treated as simple variable binders.

assign (𝑓 : ContextExprM Expr) (Γ : Context)
: BoxM Box
|𝑥 𝑏pure<*> assign 𝑓 Γ 𝑥 <*> assign 𝑓 [..Γ, 𝑥] 𝑏
| 𝒢 𝑚 𝑏pure 𝒢 <*> assign 𝑓 Γ 𝑚 <*> assign 𝑓 [..Γ, 𝑚] 𝑏
| 𝒭 𝑟pure 𝒭 <*> assign 𝑓 Γ 𝑟
| 𝒜 𝑏₁ 𝑥 𝑏₂pure 𝒜 <*> assign 𝑓 Γ 𝑏₁ <*> assign 𝑓 Γ 𝑥 <*> assign 𝑓 [..Γ, 𝑥] 𝑏₂
| 𝒪 𝑏₁ 𝑏₂pure 𝒪 <*> assign 𝑓 Γ 𝑏₁ <*> assign 𝑓 Γ 𝑏₂
| 𝒱 𝑥 𝑡 𝑏pure 𝒱 <*> assign 𝑓 Γ 𝑥 <*> assign 𝑓 Γ 𝑡 <*> assign 𝑓 [..Γ, 𝑥𝑡] 𝑏

3.4.2. Typing judgements for Box

In Section 2.4, I defined contexts Γ, metavariable contexts 𝑀. As covered in Carneiro's thesis [Car19[Car19]Carneiro, MarioLean's Type Theory (2019)Masters' thesis (Carnegie Mellon University)(link)], Lean's type theory affords a set of inference rules on typing judgements Γ𝑡α, stating that the expression 𝑡 has the type α in the context Γ. However, these inference rules are only defined for expressions 𝑡 : Expr that do not contain metavariables. In Appendix A.1, I extend these judgements (A.10), (A.11) to also include expressions containing metavariable contexts 𝑀;Γ𝑡α.

In a similar way, we can repeat this for Box: given contexts 𝑀 and Γ we can define a typing judgement 𝑀;Γ𝑏β where 𝑏 : Box and β is a type. The inference rules for this are given in (3.13). These have been chosen to mirror the typings given in Section 2.4.3.


Typing inference rules for Box. Compare with (A.10) and (A.11) in Appendix A.1.

𝑀;(..Γ, 𝑥α)𝑏β

𝑀;Γ((𝑥α), 𝑏)(Π (𝑥α), β)

𝑀;Γ ⊢ 𝒭 𝑡α

𝑀;Γ(𝒢 (?𝑥α), 𝑏)β
𝑀;[..Γ, (𝑥α)]𝑏₂β

𝑀;Γ(𝒜 𝑏₁ (𝑥α) 𝑏₂)β

𝑀;Γ(𝒪 𝑏₁ 𝑏₂)α
𝑀;[..Γ, (𝑥α)]𝑏β

𝑀;Γ(𝒱 (𝑥α𝑣), 𝑏)β

These typing rules have been designed to match the typing rules (A.10) of the underlying proof terms that a Box produces when solved, as I will show next.

3.4.3. Results of a Box

The structure of Box is designed to represent a partially complete expression without the use of unbound metavariables. Boxes can be converted to expressions containing unbound metavariables using results : BoxSet Expr as defined in (3.14).


Definition of results. 𝑟[𝑥] denotes a delayed abstraction (Appendix A.3.1) needed in the case that 𝑟 contains metavariables.

: BoxSet Expr
|(𝑥α) 𝑏{(Expr.λ (𝑥α) 𝑟[𝑥]) for 𝑟 in results 𝑏}
| 𝒢 (𝑥α) 𝑏results 𝑏
| 𝒭 𝑡{𝑡}
| 𝒜 𝑏₁ (𝑥α) 𝑏₂
{𝑠 for 𝑠 in results𝑥𝑟𝑏₂
for 𝑟 in results 𝑏₁}
| 𝒪 𝑏₁ 𝑏₂results 𝑏₁results 𝑏₂
| 𝒱 (𝑥α) 𝑏{(Expr.let 𝑥 𝑏 𝑟) for 𝑟 in results 𝑏}

A 𝑏 : Box is solved when there are no remaining 𝒢 entries in it. When 𝑏 is solved, the set of results for 𝑏 does not contain any metavariables and hence can be checked by the kernel. In the case that 𝑏 is unsolved, the results of 𝑏 contain unbound metavariables. Each of these metavariables corresponds to a 𝒢-binder that needs to be assigned.

Lemma 3.15 (compatibility): Suppose that 𝑀;Γ𝑏 : α for 𝑏 : Box as defined in (3.13). Then [..𝑀, ..goals 𝑏];Γ𝑟α. (Say that 𝑏 is compatible with 𝑟results 𝑏.) Here, goals 𝑏 is the set of metavariable declarations formed by accumulating all of the 𝒢-binders in 𝑏. (3.16) shows a formal statement of Lemma 3.15.


Statement of Lemma 3.15. That is, take a 𝑏 : Box and α : Expr, then if 𝑏α in the context 𝑀;Γ and 𝑟 : Expr is a result of 𝑏 (3.14); then 𝑟α in the context 𝑀;Γ with additional metavariables added for each of the goals in 𝑏.

𝑟results 𝑏

[..𝑀, ..goals 𝑏];Γ𝑟α

Lemma 3.15 states that given a box 𝑏 and an expression 𝑟 that is a result of 𝑏, then if 𝑏 is a valid box with type α then 𝑟 will type to α too in the metavariable context including all of the goals in 𝑏.

Lemma 3.15 is needed because it ensures that our Box will produce well-typed expressions when solved. Using Lemma 3.15, we can find box-tactics m : BoxOption Box - partial functions from Box to Box - such that 𝑀;Γ𝑏α𝑀;Γm 𝑏α whenever 𝑏dom m. Hence a chain of such box-tactic applications will produce a result that satisfies the initial goal.

Proof: Without loss of generality, we only need to prove Lemma 3.15 for a 𝑏 : Box with no 𝒪 boxes and a single result [𝑟] = results 𝑏. To see why, note that any box containing an 𝒪 can be split as in (3.17) until each Box has one result. Then we may prove Lemma 3.15 for each of these in turn.


) = results(



Write result 𝑏 to mean this single result 𝑟. Performing induction on the typing judgements for boxes, the most difficult is 𝒜-typing, where we have to show (3.18).


The induction step that must be proven for the 𝒜-box case of Lemma 3.15.

𝑀;[..Γ, (𝑥α)]𝑏₂β
𝑀';Γresult 𝑏₁α
𝑀';[..Γ, (𝑥α)]result 𝑏₂β

𝑀';Γresult (𝒜 𝑏₁ (𝑥α) 𝑏₂)β

where 𝑀' := [..𝑀, ..goals (𝒜 𝑏₁ (𝑥α) 𝑏₂)]. To derive this it suffices to show that result is a 'substitution homomorphism':


result is a substitution homomorphism.

𝑀;Γσ ok

𝑀;Γσ (result 𝑏)result (σ 𝑏)

where σ is a substitutionSee Section 2.4.1. A substitution is a partial map from variables to expressions. in context Γ and is the definitional equality judgement under Γ. Then we have


Here, 𝑥𝑒𝑏 is used to denote substitution applied to 𝑏. That is, replace each occurrence of 𝑥 in 𝑏 with 𝑒.

result (𝒜 𝑏₁ (𝑥α) 𝑏₂)
result (𝑥result 𝑏₁𝑏₂)
≡ ⦃𝑥result 𝑏₁(result 𝑏₂)
(λ (𝑥α), result 𝑏₂) (result 𝑏₁)

We can see the substitution homomorphism property of result holds by inspection on the equations of result, observing that each LHS expression behaves correctly. Here is the case for :


result and σ obey the 'substitution homomorphism' property on the case of . Here λ is used to denote the internal lambda constructor for expressions. Note here we are assuming dom σΓ, so 𝑥dom σ, otherwise dom σ.

result (σ ((𝑥α) 𝑏))
result $(𝑥(σ α)) (σ 𝑏)
(λ (𝑥(σ α)), (result (σ 𝑏))[𝑥])
(λ (𝑥(σ α)), (σ (result 𝑏))[𝑥]) -- ∵ induction hypothesis
σ (λ (𝑥α), (result 𝑏))
σ (result ((𝑥α) 𝑏))

This completes the proof of Lemma 3.15. By using compatibility, we can determine whether a given box-tactic m : BoxOption Box is sound. Define a box-tactic m to be sound when for all 𝑏dom m we have some α such that 𝑀;Γ(m 𝑏)α whenever 𝑀;Γ𝑏α.

Hence, to prove a starting propositionOr, in general, a type α. P, start with an initial box 𝑏₀ := 𝒢 (?t₀∶P) (𝒭 ?t). Then if we only map 𝑏₀ with sound box-tactics to produce a solved box 𝑏, then each of results 𝑏 always has type α and hence is accepted by Lean's kernel.

Given a box-tactic m that is sound on 𝑏, then we can construct a sound box-tactic on (𝑥α) 𝑏 too that acts on the nested box 𝑏.

3.4.4. Escape-hatch to tactics

As discussed in Section 2.4.4, many provers, including Lean 3, come with a tactic combinator language to construct proofs through mutating an object called the TacticState comprising a metavariable context and a list of metavariables called the goals. In Section 3.1 I highlighted some of the issues of this approach, but there are many built-in and community-made tactics which can still find use within a HumanProof proof. For this reason, it is important for HumanProof to provide an 'escape hatch' allowing these tactics to be used within the context of a HumanProof proof seamlessly. I achieve this compatibility system between Boxes and tactics through defining a zipper [Hue97[Hue97]Huet, GérardFunctional Pearl: The Zipper (1997)Journal of functional programming(link)] structure on Boxes (Appendix A.2) and then a set of operations for soundly converting an underlying TacticState to and from a Box object. The details of this mechanism can be found in Appendix A.2. It is used to implement some of the box-tactics presented next in Section 3.5, since in some cases the box-tactic is the same as its tactic-based equivalent.

3.4.5. Summary

In this section, I defined assignability on Boxes and the valid typing judgement inference rules on Box. I used these to define the soundness of a box-tactic and showed that for a box-tactic to be sound, it suffices to show that its typing judgement is preserved through the use of Lemma 3.15. I also briefly review Appendix A, which presents a mechanism for converting a tactic-style proof to a box-tactic.

3.5. Human-like-tactics for Box.

Using the framework presented above we can start defining sound tactics on Boxes and use Box to actualise the kinds of reasoning discussed in Section 3.1. Many of the box-tactics here are similar to inference rules that one would find in a usual system, and so I do not cover these ones in great detail. I also skip many of the soundness proofs, because in Appendix A I instead provide an 'escape hatch' for creating sound box-tactics from tactics in the underlying metavariable-oriented development calculus.

3.5.1. Simplifying box-tactics

We have the following box-tactics for reducing Boxes, these should be considered as tidying box-tactics.


Reduction box-tactics for Box. These are box-tactics which should always be applied if they can and act as a set of reductions to a box. Note that these are not congruent; for example 𝒪-reduce and 𝒪-reduce on 𝒪 (𝒭 𝑒₁) (𝒭 𝑒₂) produce different terminals.

𝒜-reduce :=

𝑡₀ :=
𝒢-reduce :=

?𝑡₀ : α
if ?𝑡₀𝑒

3.5.2. Deleting tactics

These are box-tactics that cause a Box to become simpler, but which are not always 'safe' to do, in the sense that they may lead to a Box which is impossible to solve. That is, the Box may still have a true conclusion but it is not possible to derive this from the information given on the box. For example, deleting a hypothesis 𝑝P, may prevent the goal ?𝑡P from being solved. The rules for deletion are presented in (3.23).

To motivate 𝒪-revert tactics, recall that an 𝒪-box 𝑏₁𝑏₂ represents the state that either 𝑏₁ or 𝑏₂ needs to be solved, so 𝒪-reversion amounts to throwing away one of the boxes. This is similar to 𝒪-reduce in (3.22) with the difference being that we do not need one of the boxes to be solved before applying. These are useful when it becomes clear that a particular 𝒪-branch is not solvable and can be deleted.


Deletion box-tactics. 𝒪-revert and 𝒪-revert take an 𝒪-box and remove one of the branches of the 𝒪-box. 𝒱-delete removes a 𝒱-box and replaces each reference to the variable bound by the 𝒱-box with its value.

𝒱-delete :=
𝑥 : α := 𝑒



3.5.3. Lambda introduction

In tactics, an intro tactic is used to introduce Π-bindersΠ-binders Π (𝑥 : α), β are the dependent generalisation of the function type αβ where the return type β may depend on the input value α.. That is, if the goal state is ⊢ Π (𝑥 : α), β[𝑥] the intro tactic produces a new state (𝑥 : α)β[𝑥]. To perform this, it assigns the goal metavariable ?t: Π (𝑥 : α), β[𝑥] with the expression λ (𝑥 : α), ?t where ?t: β[𝑥] is the new goal metavariable with context including the additional local variable 𝑥 : α.

The intro tactic on Box is analogous, although there are some additional steps required to ensure that contexts are preserved correctly. The simplified case simple_intro (3.24), performs the same steps as the tactic version of intro.


A simple variable introduction box-tactic. Note that that the new goal ?t is not wrapped in a lambda abstraction because it is abstracted earlier by the box.

simple_intro :=

?t: Π (𝑥 : α), β
𝑥 : α

?t: β

The full version (3.25) is used in the case that the -box is not immediately followed by an 𝒭-box. In this case, a conjunctive 𝒜-box must be created in order to have a separate context for the new (𝑥 : α) variable.


The full version of the lambda introduction box-tactic. The box on the rhs of is an 𝒜 box: 𝒜 (𝑥, 𝒢 ?t, 𝒭 ?t) t𝑏.

intro :=

?t: Π (𝑥 : α), β

𝑥 : α

?t: β

The fact that intro is sound follows mainly from the design of the definitions of :

Structural sharing is defined as making use of the same substructure multiple times in a larger structure. For example, a tree with two branches being the same would be using structural sharing if the sub-branches used the same object in memory. Define 𝑏' to be (𝑥 : α), 𝒢 (?t: β), 𝒭 ?t, represented graphically in (3.26). The typing judgement (3.26) follows from the typing rules (3.13).


The judgement that 𝑏' has type Π (𝑥 : α), β. β may possibly depend on 𝑥.

𝑥 : α

?t: β
: Π (𝑥 : α), β

By the definition of a sound box-tactic we may assume (𝒢 ?t, 𝑏) : γ for some type γ. From the 𝒢 typing rule (3.13) we then have [?t];∅ ⊢ 𝑏 : γ. Then it follows from 𝒜 typing (3.13) that ⊢ 𝒜 𝑏' (t: Π (𝑥 : α), β) 𝑏 : γ where 𝑏' :=(𝑥 : α), 𝒢 (?t: β), 𝒭 ?t.

3.5.4. Split and cases tactics

Here I present some box-tactics for performing introduction and elimination of the type. The Box version of split performs the same operation as split in Lean: introducing a conjunction. A goal ?t: PQ is replaced with a pair of new goals (?t,?t). These can be readily generalised to other inductive datatypes with one constructorOne caveat is that the use of requires the use of a non-constructive axiom of choice with this method. This is addressed in Section 3.5.8. In the implementation, these are implemented using the tactic escape-hatch described in Appendix A.


Box-tactic for introducing conjunctions.

split :=

?t: PQ

?t: P
?t: Q
...(?t₀ ↦ ⟨?t,?t₂⟩⦄ 𝑏)

Similarly we can eliminate a conjunction with cases.


Box-tactic for eliminating conjunctions. fst : PQP and snd : PQQ are the -projections. In the implementation; h₀ is hidden from the visualisation to give the impression that the hypothesis h₀ has been 'split' in to h₁ and h₂.

cases :=
h₀ : PQ

h₀ : PQ
h₁ : P := fst h₀
h₂ : Q := snd h₀


3.5.5. Induction box-tactics

-elimination (3.28) from the previous section can be seen as a special case of induction on datatypes. Most forms of dependent type theory use inductive datatypes (see Section 2.2.3) to represent data and propositions, and use induction to eliminate them. To implement induction in CICCalculus of Inductive Constructions. The foundation used by Lean 3 and Coq (Section 2.1.3). See [Car19 §2.6] for the axiomatisation of inductive types within Lean 3's type system., each inductive datatype comes equipped with a special constant called the recursor. This paradigm broadens the use of the words 'recursion' and 'induction' to include datastructures that are not recursive.

For example, we can view conjunction AB : Prop as an inductive datatype with one constructor mk : ABAB. Similarly, a disjunctive AB has two constructors inl : AAB and inr : BAB. Interpreting as implication, we recover the basic introduction axioms for conjunction and disjunction. The eliminators for and are implemented using recursors given in (3.29).


Recursors for conjunction and disjunction.

-rec :(A B C : Prop), (ABC)(AB)C
-rec :(A B C : Prop), (AC)(BC)(AB)C

Performing an induction step in a CIC theorem prover such as Lean amounts to the application of the relevant recursor. Case analysis on a disjunctive hypothesis makes for a good example of recursion, the recursor -rec : (PC)(QC)(PQ)C is used. Given a box (h₀ : PQ), 𝑏 where h₀𝑏α, the -cases box-tactic sends this to the box defined in (3.30). This is visualised in (3.31).


Explicit datastructure showing the resulting Box after performing -cases on (h₀ : PQ), 𝑏.

𝒜 ((h₁P), 𝑏₁) (𝑐₁Pα) (
𝒜 ((h₂Q), 𝑏₂) (𝑐₂Qα) (
𝒭 (-rec 𝑐₁ 𝑐₂ h₀)
where 𝑏₁ :=h₀inl h₁𝑏
𝑏₂ :=h₀inr h₂𝑏

Special case of recursion for eliminating statements. The right-hand side of is simplified for the user, but is represented as a nested set of 𝒜 boxes as explicitly written in (3.30). 𝑏₁ and 𝑏₂ are defined in (3.30).

cases :=
h₀ : PQ


h₁ : P

h₂ : Q


Note that the 𝑏 : Box in (3.31) may contain multiple goals. When the cases box-tactic is applied to (h₀PQ), 𝑏, the resulting Box on the rhs of (3.31) results in two copies of these goals. This implements structural sharing of goals as motivated in Section 3.1.3. Structural sharing has a significant advantage over the goal-state style approach to tactics, where the equivalent cases tactic would have to be applied separately to each goal if there were multiple goals.

This structurally-shared induction step also works on recursive datastructures such as lists and natural numbers. These datatypes' recursors are more complicated than non-recursive datastructures such as those in (3.29) in order to include induction hypotheses. The recursor for natural numbers is shown in (3.32). (3.33) is the corresponding box-tactic that makes use of (3.32). (3.34) is the detailed Box structure for the right-hand side of (3.33).


Recursor for natural numbers. -rec can be seen to have the same signature as mathematical induction on the natural numbers.

-rec :
(𝒞 : Type) -- motive
(𝒞 0) -- zero case
((𝑖 : ) → 𝒞 𝑖 → 𝒞 (𝑖 + 1)) -- successor case
(𝑖 : ) → 𝒞 𝑖

Induction box-tactic on natural numbers. Implemented using the 'escape hatch' detailed in Appendix A. Here, α is the result type of 𝑏 (Section 3.4.2). That is, (𝑛:)𝑏α.

induction :=
𝑛 :


𝑛 :
𝑕 : α



Detail on the rhs of (3.33). The signature for -rec is given in (3.32).

𝒜 (𝑛0𝑏) (𝑐₁ ∶ ⦃𝑛0α) (
𝒜 ((𝑛),(𝑕α),𝑛𝑛+1𝑏) (𝑐₂∶ ⦃𝑛𝑛+1α) (
𝒭 (-rec (𝑛α) 𝑐₁ 𝑐₂ 𝑛)

In general, finding the appropriate motive 𝒞 for an induction step amounts to a higher order unification problem which was shown to be undecidable [Dow01[Dow01]Dowek, GilesHigher-order unification and matching (2001)Handbook of automated reasoning(link) §3]. However, in many practical cases 𝒞 can be found and higher-order provers come equipped with heuristics for these cases, an early example being Huet's semidecidable algorithm. Rather than reimplementing these heuristics, I implement induction box-tactics on Box by using the 'escape hatch' feature (Section 3.4.4).

3.5.6. Introducing 𝒪 boxes

The purpose of 𝒪 boxes is to enable backtracking and branches on Boxes that enables structural sharing. The G&G prover [GG17[GG17]Ganesalingam, Mohan; Gowers, W. T.A fully automatic theorem prover with human-style output (2017)Journal of Automated Reasoning(link)] takes a similar approach. For example, suppose that we had a goal xAB for some sets A, B. We might have some lemmas of the form h₁ : PxA and h₂ : QxB but we are not yet sure which one to use. In a goal-based system, if you don't yet know which injection to use, you have to guess and manually backtrack. However, there may be some clues about which lemma is correct that only become apparent after applying an injection. In the above example, if only h₃ : P is present as a hypothesis, it requires first performing injection before noticing that h₁ is the correct lemma to apply. In Section 3.7.1 I discuss more advanced, critic-like workflows that 𝒪-boxes also enable.

The 𝒪 box allows us to explore various counterfactuals without having to perform any user-level backtracking (that is, having to rewrite proofs). The primitive box-tactic that creates new 𝒪-boxes is shown in (3.35). This is used to make more 'human-like' box-tactics such as -split (3.36).


Box-tactic for introducing an 𝒪-box by duplication.

𝒪-intro :=


Box-tactic for introducing an 𝒪-box by duplication.

-intro :=

?𝑡 : PQ

?𝑡 : P

?𝑡 : Q

3.5.7. Unification under a Box

Unification is the process of taking a pair of expressions 𝑙 𝑟 : Expr within a joint context 𝑀;Γ and finding a valid set of assignments of metavariables σ in 𝑀 such that (𝑀 + σ);Γ𝑙𝑟. Rather than develop a whole calculus of sound unification for the Box system, I can use the 'escape hatch' tactic compatibility layer developed in Appendix A to transform a sub-Box to a metavariable context and then use the underlying theory of unification used for the underlying development calculus of the theorem prover (in this case Lean). This is a reasonable approach because unifiers and matchers for theorem provers are usually very well developed in terms of both features and optimisation, so I capitalise on a unifier already present in the host proof assistant has a perfectly good one already.

3.5.8. Apply

In textbook proofs of mathematics, often an application of a lemma acts under binders. For example, let's look at the application of fs 𝑛 being continuous from earlier.


An example lemma h₁ to apply. h₁ is a proof that fs 𝑛 is continuous.

h₁ :
(𝑥 : X) (ε : ) (h₀ : ε > 0),
(δ : ) (h₁ : δ > 0),
(𝑦 : X) (h₂ : dist 𝑥 𝑦 < δ), dist (f 𝑥) (f 𝑦) < ε

In the example the application of h₁ with 𝑁, ε, h₃, and then eliminating an existential quantifier δ and then applying more arguments y, all happen in one step and without much exposition in terms of what δ depends on. A similar phenomenon occurs in backwards reasoning. If the goal is dist (f 𝑥) (f 𝑦) < ε, in proof texts the continuity of f is applied in one step to replace this goal with dist x y < δ, where δ is understood to be an 'output' of applying the continuity of f.

Contrast this with the logically equivalent Lean tactic script fragment (3.38):


A Lean tactic-mode proof fragment that is usually expressed in one step by a human, but which requires two steps in Lean. The show lines can be omitted but are provided for clarity to show the goal state before and after the obtain and apply steps. The obtain_,_,_: 𝑃 tactic creates a new goal 𝑡 : 𝑃 and after this goal is solved, performs case-elimination on 𝑡. Here, obtainδ, δ_pos, h₁ introduces δ : , δ_pos : δ > 0 and h₁ to the context.

show dist (f x) (f y) < ε,
obtain ⟨δ, δ_pos, h₁⟩ : ∃ δ, δ > 0 ∧ ∀ y, dist x y < δ → dist (f x) (f y) < ε,
applycontinuous f,
apply h,
show dist x y < δ,

In order to reproduce this human-like proof step, we need to develop a theory for considering 'complex applications'. A further property we desire is that results of the complex application must be stored such that we can recover a natural language write-up to explain it later (e.g., creating "Since f is continuous at x, there is some δ...").

The apply subsystem works by performing a recursive descent on the type of the assumption being applied. For example, applying the lemma given in (3.37) to a goal 𝑡 : P attempts to unify P with dist (f ?𝑥) (f ?𝑦) < ?ε with new metavariables ?𝑥 ?𝑦 : X, ε : . If the match is successful, it will create a new goal for each variable in a Π-binderNote that is sugar for Π. above the matching expression and a new 𝒱-binder for each introduced -variable and each conjunct. These newly introduce nested boxes appear in the same order as they appear in the applied lemma.

This apply system can be used for both forwards and backwards reasoning box-tactics. Above deals with the backwards case, in the forwards case, the task is reversed, with now a variable bound by a Π-binder being the subject to match against the forwards-applied hypothesis.

An example of applying (3.37) to the goal dist (f x) (f y) < ε can be seen in (3.1).


An example of applying (3.37) to t. It produces a set of nested goals in accordance with the structure of the binders in (3.37). Result Boxes are omitted.

applycontinuous f:
𝑥 𝑦 : X
ε :

?t: dist (f 𝑥) (f 𝑦) < ε
𝑥 𝑦 : X
ε :

?t: ε > 0
δ : := _
h₂ : δ > 0 := _

?t: dist 𝑥 𝑦 < δ A note on using apply with existential statements

One complication with this approach to apply is performing many logical inference steps when applying a lemma in one go. There is a technical caveat with applications of existential statements such as (δ : ), d(𝑥, 𝑦) < δ: by default, Lean is a non-classical theorem prover, which here amounts to saying that the axiom of choice is not assumed automatically. Without the axiom of choice, it is not generally possible to construct a projection function ε :(𝑥 : α), P [𝑥]α such that P[ε ] is true for all :(𝑥 : α), P. There are two ways to overcome this limitation:

  1. Assume the axiom of choice and make use of the nonconstructive projector ε.

  2. When an apply step encounters an existential quantifier, wrap the entire proof in an existential quantifier recursorRecursors are discussed in Section 3.5.5. -rec (C : Prop) : ((𝑥 : α), P 𝑥C)((𝑥 : α), P 𝑥)C using 𝒜-boxes. This is performed in exactly the same manner that induction box-tactics are applied in Section 3.5.5.

HumanProof, as it is currently implemented, uses strategy 1. This prevents proofs from being constructive, but is otherwise not so great a concession, since mathematicians regularly make use of this in fields outside logic. There was some effort to also implement strategy 2, but I dropped it.

3.5.9. Summary

This section introduced a set of sound box-tactics that are implemented for the HumanProof system. In the next section we will see how these box-tactics can be used to create natural language write-ups of proofs.

3.6. Natural language generation of proofs

In this section I detail how the above box architecture is used to produce natural language writeups as the proof progresses. The general field is known as Natural Language Generation (NLG). You can find a background survey of NLG both broadly and within the context of generating proofs in Section 2.7.

Here I lean on the work of Ganesalingam, who in his thesis [Gan10[Gan10]Ganesalingam, MohanThe language of mathematics (2010)PhD thesis (University of Cambridge)(link)] has specified a working theory of the linguistics of natural language mathematics. As well as generating a formally verifiable result of a proof, I also extend on G&G by providing some new mechanisms for converting Lean predicates and typeclasses in to English language sentences. That is, in the implementation of the G&G theorem prover, many natural language constructs such as " is a metric space" were hard-coded in to the system. In this work I provide a general framework for attaching verbalisations of these kinds of statements to typeclasses and predicates within Lean. I also make the resulting write-up interactive; emitting a partial proof write-up if the proof-state is not yet solved and also inspecting the natural language write-up through the widgets system are possible. In contrast G&G's output was a static file.

The goal of this section is to demonstrate that the Box architecture above is representative of human-like reasoning by constructing natural language writeups of the proofs created using Boxes. As such the NLG used here is very simple compared to the state of the art and doesn't make use of any modern techniques such as deep learning. The output of this system is evaluated by real, human mathematicians in Chapter 6. An example of a proof generated by the system is shown below in Output 3.40. There are some challenges in converting a Box proof to something that reads like a mathematical proof that I will detail here.

Output 3.40

Output from the HumanProof natural language write-up system for a proof that the composition of continuous functions is continuous.

Let , and be metric spaces, let be a function and let be a function . Suppose is continuous and is continuous. We need to show that is continuous. Let and let . We must choose such that . Since is continuous, there exists a such that whenever . Since is continuous, there exists a such that whenever . Since , we are done by choosing to be .

3.6.1. Overview

The architecture of the NLG component is given in Figure 3.41. The design is similar to the standard architecture discussed in Section 2.7.1. In Section 3.1.2 I explained the decision to design the system to permit only a restricted set of box-tactics on a Box representing the goal state of the prover. To create the natural language write-up from these box-tactics, each box-tactic also emits an Act object. This is an inductive datatype representing the kind of box-tactic that occurred. So for example, there is an Intro : List BinderAct that is emitted whenever the intro box-tactic is performed, storing the list of binders that were introduced. A list of Acts is held in the state monad for the interactive proof session. This list of acts is then fed to a micro-planner, which converts the list of acts to an abstract representation of sentencesSometimes referred to as a phrase specification. These sentences are converted to a realised sentence with the help of Run which is a form of S-expression [McC60[McC60]McCarthy, JohnRecursive functions of symbolic expressions and their computation by machine, Part I (1960)Communications of the ACM(link)] containing text and expressions for interactive formatting. This natural language proof is then rendered in the output window using the widgets system (Chapter 5).

Figure 3.41

Overview of the pipeline for the NLG component of HumanProof. A Box has a series of box-tactics performed upon it, each producing an instance of Act, an abstract representation of what the box-tactic did. A list of all of the Acts from the session is then converted in to a list of sentences, which is finally converted to an S-expression-like structure called Run. Compare this with the standard architecture given in Figure 2.41; the main difference being that the macroplanning phase is performed by the choice of box-tactics performed on boxes as detailed in Section 3.5.

3.6.2. Grice's laws of implicature

One resource that has proven useful in creating human-like proofs is the work of the Grice on implicature in linguistics [Gri75[Gri75]Grice, Herbert PLogic and conversation (1975)Speech acts(link)]. To review, Grice states that there is an unwritten rule in natural languages that one should only provide as much detail as is needed to convey the desired message. For example, the statement "I can't find my keys" has the implicature "Do you know where my keys are?", it implies that the keys may have been lost at the current location and not in a different part of town and so on. If superfluous detail is included, the reader will pick this up and try to use it to infer additional information. Saying "I can't find my keys at the moment" interpreted literally has the same meaning as "I can't find my keys", but implicitly means that I have only just realised the key loss or that I will be able to find them soon. Grice posits four maxims that should be maintained in order for a sentence or phrase to be useful:

  1. Quantity The contribution should contain no more or less than what is required. Examples: "Since and is prime, ". "Let be a positive real such that ."

  2. Quality Do not say things for which you don't have enough evidence or things that are not true. An example here would be a false proof.

  3. Relation The contributed sentence should be related to the task at hand. Example; putting a true but irrelevant statement in the middle of the proof is bad.

  4. Manner The message should avoid being obscure, ambiguous and long-winded.

Mathematical texts are shielded from the more oblique forms of implicature that may be found in general texts, but Grice's maxims are still important to consider in the construction of human-readable proofs and serve as a useful rule-of-thumb in determining when a generated sentence will be jarring to read.

With respect to the quantity maxim, it is important to remember also that what counts as superfluous detail can depend on the context of the problem and the skill-level of the reader. For example, one may write:

Suppose and are open subsets of . Since is continuous, is open.

A more introductory text will need to also mention that is a topological space and so is open. Generally these kinds of implicit lemma-chaining can become arbitrarily complex, but it is typically assumed that these implicit applications are entirely familiar to the reader. Mapping ability level to detail is not a model that I will attempt to write explicitly here. One simple way around this is to allow the proof creator to explicitly tag steps in the proof as 'trivial' so that their application is suppressed in the natural language write-up. Determining this correct level of detail may be a problem in which ML models may have a role to play.

3.6.3. Microplanning symbolic mathematics

From a linguistic perspective, a remarkable property of mathematical texts is the interlacing of mathematical symbols and natural language. In the vast majority of cases, each symbolic construct has a natural language equivalent (else verbalising that symbol in conversation would be difficult). For example: "" versus " plus ". Sometimes multiple verbalisations are possible: can be " implies " or " whenever ". Sometimes the the symbolic form of a statement is not used as frequently: " is prime" versus . In making text flow well, the decision of when to move between symbolic and textual renderings of a mathematical proof is important. The rule-of-thumb that I have arrived at is to render the general flow of the proof's reasoning using text and to render the objects that are being reasoned about using symbols. The idea here is that one should be able to follow the rough course of argument whilst only skimming the symbolic parts of the proof.

3.6.4. Microplanning binders with class predicate collections

In mathematics, it is common that a variable will be introduced in a sentence and then referenced in later sentences. For example, one will often read sentences such as "Let be a metric space and let and be points in ". This corresponds to the following telescopeA telescope is a list of binders where the type of a binder may depend on variables declared ealier in the list. Telescopes are equivalent to a well-formed context (see Section 2.1.3) but the term telescope is also used to discuss lists of binders that appear in expressions such as lambda and forall bindings. of binders: (X : Type) (_ : metric_space X) (x y : X). These effectively act as 'linguistic variable binders'.

In this subsection I will highlight how to convert lists of binders to natural language phrases of this form. To the best of my knowledge this is an original contribution so I will explain this mechanism in more detail. This approach is inspired by the idea of 'notions' as first used in the ForTheL controlled natural language parser for the SAD project [VLP07[VLP07]Verchinine, Konstantin; Lyaletski, Alexander; Paskevich, AndreiSystem for Automated Deduction (SAD): a tool for proof verification (2007)International Conference on Automated Deduction(link), Pas07[Pas07]Paskevich, AndreiThe syntax and semantics of the ForTheL language (2007)PhD thesis (Université Paris XII)(link), VLPA08[VLPA08]Verchinine, Konstantin; Lyaletski, Alexander; Paskevich, Andrei; et al.On correctness of mathematical texts from a logical and practical point of view (2008)International Conference on Intelligent Computer Mathematics(link)] also used by Naproche/SAD [DKL20[DKL20]De Lon, Adrian; Koepke, Peter; Lorenzen, AntonInterpreting Mathematical Texts in Naproche-SAD (2020)Intelligent Computer Mathematics(link)]. Ganesalingam [Gan10[Gan10]Ganesalingam, MohanThe language of mathematics (2010)PhD thesis (University of Cambridge)(link)] refers to these as non-extensional types and Ranta [Ran94[Ran94]Ranta, AarneSyntactic categories in the language of mathematics (1994)International Workshop on Types for Proofs and Programs(link)] as syntactic categories. The act of The PROVERB system [HF97[HF97]Huang, Xiaorong; Fiedler, ArminProof Verbalization as an Application of NLG (1997)International Joint Conference on Artificial Intelligence(link)] and the G&G system [GG17[GG17]Ganesalingam, Mohan; Gowers, W. T.A fully automatic theorem prover with human-style output (2017)Journal of Automated Reasoning(link)] provide a mechanism for generating natural language texts using a similar technique for aggregating assumptions, however these approaches do not allow for the handling of more complex telescopes found in dependent type theory. Table 3.42 presents some examples of the kinds of translations in question.

Table 3.42

Examples of generating natural language renderings of variable introductions from type-theory telescopes. Square brackets on a binder such as [group G] denote a typeclass binder. This typeclass binder is equivalent to the binder (𝔤 : group G) where the binder name 𝔤 is omitted. Typeclasses were first introduced by Hall et al for use with the Haskell programming language [HHPW96]. Typeclasses are used extensively in the Lean 3 theorem prover. A description of their implementation can be found in [MAKR15 §2.4].

TelescopeGenerated text
(X : Type) [metric_space X] (𝑥 𝑦 : X)LetX be a metric space and let 𝑥 and 𝑦 be points in X.
(G : Type) [group G] (𝑥 𝑦 : G)LetG be a group and let 𝑥 and 𝑦 be elements of G.
(G : Type) [group G] (H : set G) (h₁ : subgroup.normal G H)LetG be a group and H be a normal subgroup of G.
(𝑎 𝑏 : ) (h₁ : coprime 𝑎 𝑏)Let𝑎 and 𝑏 be coprime integers.
(𝑓 : XY) (h₁ : continuous 𝑓)Let𝑓 : XY be a continuous function.
(T : Type) [topological_space T] (U : set T) (h₁ : open U)LetT be a topological space and let U be an open set in T.
(ε : ) (h₁ : ε > 0)Letε > 0.

[HHPW96]Hall, Cordelia V; Hammond, Kevin; Peyton Jones, Simon L; et al.Type classes in Haskell (1996)ACM Transactions on Programming Languages and Systems (TOPLAS)(link)[MAKR15]de Moura, Leonardo; Avigad, Jeremy; Kong, Soonho; et al.Elaboration in Dependent Type Theory (2015)CoRR(link)The variable introduction sentences in Table 3.42 take the role of a variable binder for mathematical discourse. This variable is then implicitly 'in scope' until its last mention in the text. Some variables introduced in this way can remain in scope for an entire book. For example, the choice of underlying field k in a book on linear algebra. As Ganesalingam notes [Gan10[Gan10]Ganesalingam, MohanThe language of mathematics (2010)PhD thesis (University of Cambridge)(link) §2.5.2], "If mathematicians were not able to use variables in this way, they would need to write extremely long sentences!"

Let's frame the problem as follows: take as input a telescope of binders (e.g, [(𝑎 : ), (𝑏 : ), (h₁ : coprime 𝑎 𝑏)]) and produce a 'variable introduction text' string as shown in the above table. The problem involves a number of challenges:

To solve this I introduce a schema of class predicate collections. Each binder in the input telescope is converted to two pieces of data; the subject expression 𝑥 and the class predicate 𝑐𝑝; which is made from one of the following constructors.

The subject expression and the class predicate for a given binder in the input telescope are assigned by consulting a lookup table which pattern-matches the binder type expressions to determine the subject expression and any additional parameters (for example T in "open set in T"). Each pair 𝑥, 𝑐𝑝 is mapped to [𝑥], [𝑐𝑝]: List Expr × List ClassPredicate. I call this a class predicate collection (CPC). The resulting list of CPCs is then reduced by aggregating [DH93[DH93]Dalianis, Hercules; Hovy, EduardAggregation in natural language generation (1993)European Workshop on Trends in Natural Language Generation(link)] adjacent pairs of CPCs according to (3.43).


Rules for aggregating class predicate collections.

𝑥𝑠, 𝑐𝑝𝑠,𝑦𝑠, 𝑐𝑝𝑠 ⟩ ↝ ⟨𝑥s ++ 𝑦𝑠, 𝑐𝑝𝑠
𝑥𝑠, 𝑐𝑝𝑠₁,𝑥𝑠, 𝑐𝑝𝑠₂⟩ ↝ ⟨𝑥s, 𝑐𝑝𝑠₁ ++ 𝑐𝑝𝑠₂

In certain cases, the merging operation can also delete class predicates that are superseded by later ones. An example is that if we have (𝑥 : X) (h₁ : 𝑥A), this can be condensed directly to [𝑥], [symbolic_postfix "∈ A"] which realises to "Let 𝑥A" instead of the redundant "Let 𝑥A be an element of X" which violates Grice's maxim of quantity (Section 3.6.2).

Additionally, the resulting class predicate collection list is partitioned into two lists so that only the first mention of each subject appears in the first list. For example; 𝑥 : X and h : 𝑥A both have the subject 𝑥, but "Let 𝑥 be a point and let 𝑥A"

These class predicate collections can then be realised for a number of binder cases:

class_noun can be compared to the concept of a 'notion' in ForTheL and Naproche/SAD and a 'non-extensional type' in Ganesalingam [Gan10[Gan10]Ganesalingam, MohanThe language of mathematics (2010)PhD thesis (University of Cambridge)(link)]. It takes the role of a noun that the introduced variable belongs to, and is usually preceded with an indefinite article: "let 𝑥 be an element of G".

Will some mechanism like CPCs be necessary in the future, or are they a cultural artefact of the way that mathematics has been done in the past? When designing mathematical definitions in formalised mathematics, one must often make a choice in how new datatypes are defined: should there be a new type 'positive real number' or just use the real numbers and add a hypothesis ε > 0? In natural language mathematics, one is free to move between these bundled and unbundled representations without concern. The CPC structure reflects this; "ε is a positive real" can be interpreted as either a "real that is positive" or as a single semantic type "positive real". Natural mathematics does not disambiguate between these two because they are both equivalent within its informal rules, similar to how the representation of 𝑎 + 𝑏 + 𝑐 does not need to disambiguate between (𝑎 + 𝑏) + 𝑐 and 𝑎 + (𝑏 + 𝑐) since they are equal.

3.6.5. Handling 'multi-apply' steps

The specialised apply box-tactic discussed in Section 3.5.8 requires some additional processing. The apply box-tactic returns a datatype called ApplyTree that indicates how a given lemma was applied, resulting in parameters, goals and values obtained through eliminating an existential statement. These are converted in to "since" sentences:

"Since f is continuous, there exists some δ > 0 such that d (f 𝑥) (f 𝑦) < 0 whenever d 𝑥 𝑦 < δ"

The code that produces this style of reasoning breaks down in to a Reason component indicating where the fact came from and a restatement of the fact with the parameters set to be relevant to the goal. In most cases, the Reason can simply be a restatement of the fact being used. However, it is also possible to produce more elaborate reasons. For example, apply for some hypothesis will also match preconditions on if they appear in context. That is, if h₀PQ, then apply h₀ in the box in (3.44) will automatically include the propositional assumption h₁ : P to solve the Box, instead of resulting in a new goal ?t: P. This will produce the reason "Since PQ and P, we have Q".

h₀ : PQ
h₁ : P

?t: Q

3.6.6. Multiple cases

Some problems branch into multiple cases. For example, the AB problem. Here, some additional macroplanning needs to occur, since it usually makes sense to place each of the cases in their own paragraph. When cases is performed, the resulting 𝒜-box contains two separate branches for each case as discussed in (3.28).

When a new box-tactic is performed to create an Act, box-tactics that are performed within one of these case blocks causes the Act to be tagged with the case. This is then used to partition the resulting rendered string into multiple paragraphs.

3.6.7. Realisation

As shown in Figure 3.41, the set of Acts is compiled to a sequence of Sentence objects and these are converted to a run of text. As detailed in Section 2.7 this last step is called realisation. In the realisation phase, each sentence is converted to a piece of text containing embedded mathematics. Each statement is constructed through recursively assembling canned phrases representing each sentence. This means that longer proofs can become monotonous but the application of synonymous phrases could be used to add variation. However, the purpose of this NLG system is to produce 'human-like' reasoning and so if the proofs read as too monotonous, it suggests that less detail should have been included in the Act list structure.

When realising logical statements, the prose would become unnatural or ambiguous after a certain depth. After a depth of two these statements switch to being entirely symbolic. For example: (PQ)XY would recursively render in natural language naïvely as "Y whenever X and Q whenever P", even with some more sophisticated algorithm to remove the clunkiness, writing "Y whenever X and PQ is just much clearer.

Mathematical expressions were pretty printed using Lean's pretty printing engine. However, the Lean 3 pretty printer needs a metavariable context in order to render, so it was necessary to add a tactic state object alongside the Act objects. It was necessary to store this context separately for each act because some metavariables would become solved through the course of the proof and cause confusing statements such as "by setting ε to be ε", where it should read "by setting η to be ε". Another printing issue was in the printing of values created through destructuring existential variables, which would be rendered as classical.some.

3.6.8. Summary

In this section, I detailed the workings of the natural language write-up component of the HumanProof system. I gave an overview of the standard architecture pipeline and then discussed the areas of novelty, namely the approach to producing suitable noun-phrase string from type-theoretical telescopes and on the verbalisations multi-apply steps.

3.7. Conclusion

In this chapter, I have introduced a new Box development calculus for human-like reasoning and demonstrated its compatibility (Section 3.4, Appendix A) with the development calculus of the Lean theorem prover. I have outlined the structure of a set of box-tactics within this calculus that allow for the creation of both formal and natural-language proofs of this output.

I then detailed the natural language generation component of HumanProof. The component can produce readable proofs of simple lemmas. Supporting larger projects is left for future work.

In the next chapter, we will discuss a new component to enhance the Box system for use with equational reasoning. I will make use of the work presented in this chapter in the evaluation (Chapter 6).

I will finish this chapter with some thoughts on future directions for the Box datastructure. A more general outlook on future work can be found in Section 7.2, where I also discuss potential future directions in applying deep learning to natural language generation.

3.7.1. Future work: 𝒪-critics

An avenue for future research is the definition of some additional box-tactics for the Box datastructure that allow it to work in a similar fashion to Ireland's proof critics [Ire92[Ire92]Ireland, AndrewThe use of planning critics in mechanizing inductive proofs (1992)International Conference on Logic for Programming Artificial Intelligence and Reasoning(link)]. Recall from Section 2.6.2 that proof critics (broadly speaking) are a proof planning technique that can revise a proof plan in light of information gained from executing a failed plan. 𝒪-boxes can support a similar idea as I will now exemplify in (3.45), where the statement to prove is 𝑎 𝑏 : ,𝑥 : , (𝑎𝑥)(𝑏𝑥). The proof requires spotting the trichotomy property of real numbers: 𝑥 𝑦 : , 𝑥𝑦𝑦 < 𝑥, however it is difficult to see whether this will apply from the goal state.


Sketch of some future work making use of 𝒪-boxes to perform a speculative application of the lemma 𝑎 = 𝑥𝑎𝑥 (highlighted). The box-tactics are: 𝒪-intro (3.36); apply 𝑎 = 𝑥𝑎𝑥 to the left instance of ?𝑡₁; apply reflexivity to the left ?𝑡₁, causing 𝑎 and ?𝑥 to be unified (see Section 3.5.7).

𝑎 𝑏 :

?𝑥 :
?𝑡₁ : 𝑎?𝑥
?𝑡₂ : 𝑏?𝑥
𝑎 𝑏 :

?𝑥 :
?𝑡₁ : 𝑎?𝑥
?𝑡₂ : 𝑏?𝑥

?𝑥 :
?𝑡₁ : 𝑎?𝑥
?𝑡₂ : 𝑏?𝑥
𝑎 𝑏 :

?𝑥 :
?𝑡₁ : 𝑎 = ?𝑥
?𝑡₂ : 𝑏?𝑥

?𝑥 :
?𝑡₁ : 𝑎?𝑥
?𝑡₂ : 𝑏?𝑥
𝑎 𝑏 :

?𝑡₂ : 𝑏𝑎

?𝑥 :
?𝑡₁ : 𝑎?𝑥
?𝑡₂ : 𝑏?𝑥

At this point, one can spot that the lefthand Box is no longer possible to solve unless one assumes 𝑏𝑎. However, rather than deleting the left-hand box, we can instead use this information as in (3.46).


Continuation of (3.45) to perform an 'informed backtracking'. The key step is , the inclusion of an instance of the LEM axiom triggered by the insolubility of the goal ?𝑡₂ : 𝑏𝑎 on the left-hand branch of the 𝒪 box. is an amalgamation of two box-tactics; -cases (3.31) and 𝒪-hoisting (A.42) as described in Definition A.39. is application of : 𝑎𝑏 in the left-hand box and 𝒪-reduce (3.22). is an application of ¬(𝑏𝑎)𝑎𝑏 and is an application of 𝑏𝑏.

𝑎 𝑏 :

?𝑡₂ : 𝑏𝑎

?𝑥 :
?𝑡₁ : 𝑎?𝑥
?𝑡₂ : 𝑏?𝑥
𝑎 𝑏 :
: 𝑏𝑎 ∨ ¬ 𝑏𝑎

?𝑡₂ : 𝑏𝑎

?𝑥 :
?𝑡₁ : 𝑎?𝑥
?𝑡₂ : 𝑏?𝑥
𝑎 𝑏 :

: 𝑏𝑎

?𝑡₂ : 𝑏𝑎
: 𝑎 < 𝑏

?𝑥 :
?𝑡₁ : 𝑎?𝑥
?𝑡₂ : 𝑏?𝑥
𝑎 𝑏 :
: 𝑎 < 𝑏

?𝑥 :
?𝑡₁ : 𝑎?𝑥
?𝑡₂ : 𝑏?𝑥
𝑎 𝑏 :
: 𝑎 < 𝑏

?𝑡₂ : 𝑏𝑏

The remaining research question for putting (3.45) and (3.46) into practice is to determine some heuristics when it is appropriate to perform 𝒪-intro (step ) and step , where an instance : 𝑏𝑎 ∨ ¬ 𝑏𝑎 is introduced. What is an appropriate trigger for suggesting the manoeuvre in ,, to the user?

Chapter 4

4.1. Equational reasoning

Equality chains are ubiquitous in mathematics. Here, by equality chain, I mean parts of proofs found in mathematical literature that consist of a list of expressions separated by = or some other transitive relation symbol. The chains are chosen such that each pair of adjacent expressions are clearly equal to the reader, in the sense that the equality does not need to be explicitly justified. And hence, by transitivity, the chain shows that the first expression is equal to the last one.

For example, take some vector space V. Suppose that one wishes to prove that given a linear map 𝐴 : VV, its adjoint 𝐴: VV is linearIn general, the adjoint should act on the dual space 𝐴: V*V*.. To do so one typically provides the equality chain (4.1) for all vectors 𝑥 𝑢 𝑣 : V.


The running example problem for this chapter. Here, _, _: V × V → ℂ is the inner product taking a pair of vectors to a complex number.

𝐴(𝑢 + 𝑣), 𝑥
=𝑢 + 𝑣, 𝐴 𝑥
=𝑢, 𝐴 𝑥+𝑣, 𝐴 𝑥
=𝐴𝑢, 𝑥+𝑣, 𝐴 𝑥
=𝐴𝑢, 𝑥+𝐴𝑣, 𝑥
=𝐴𝑢 + 𝐴𝑣, 𝑥⟩​

The equations that one can compose the reasoning chain from (e.g., 𝐴𝑎, 𝑏=𝑎, 𝐴 𝑏) are called rewrite rules. For the example (4.1), there are a large number of axiomatic rewrite rules available (4.2) and still more rules derived from these. We can formulate the equation rewriting problem for two expressions Γ𝑙 = 𝑟 as finding a path in the graph E whose vertices are expressions in Γ and whose edges are generated by a set of rewrite rules 𝑅 (such as those in (4.2)). Any free variables in 𝑅 are substituted with correctly typed expressions to produce ground rewrite rules that are then closed under symmetry, transitivity, congruenceA relation ~ is congruent when 𝑥 ~ 𝑦 implies 𝑡𝑧𝑥~ 𝑡𝑧𝑦 for all valid expressions 𝑥, 𝑦 and 𝑡 where 𝑡 has a free variable 𝑧..


A possible set of rewrite rules relevant to (4.1). Where 𝑎 𝑏 𝑐 : for some field ; 𝑥 𝑦 𝑧 : V for some -vector space V; and A : VV is a linear map in V. Note that the vector space laws are over an arbitrary vector space and so can also apply to the dual space V. This list is for illustrative purposes rather than being exhaustive: the details within an ITP can vary. For example, in Lean, there is a single commutativity rule (𝑎 𝑏 : α) [comm_monoid α], 𝑎 * 𝑏 = 𝑏 * 𝑎 which applies to any type with an instance of the comm_monoid typeclass.

𝑎 + 𝑏 = 𝑏 + 𝑎
𝑎 + (𝑏 + 𝑐) = (𝑎 + 𝑏) + 𝑐
0 + 𝑎 = 𝑎
𝑎 + - 𝑎 = 0
- 𝑎 + 𝑎 = 0
𝑎 * 𝑏 = 𝑏 * 𝑎
𝑎 * (𝑏 * 𝑐) = (𝑎 * 𝑏) * 𝑐
1 * 𝑎 = 𝑎
𝑎0𝑎 * 𝑎⁻¹ = 1
𝑎0𝑎⁻¹ * 𝑎 = 1
𝑎 * (𝑏 + 𝑐) = 𝑎 * 𝑏 + 𝑎 * 𝑐
𝑦 + 𝑥 = 𝑥 + 𝑦
𝑥 + (𝑦 + 𝑧) = (𝑥 + 𝑦) + 𝑧
𝑥 + 0 = 0
1𝑥 = 𝑥
(𝑎 + 𝑏)𝑥 = 𝑎𝑥 + 𝑏𝑥
(𝑎 * 𝑏)𝑥 = 𝑎(𝑏𝑥)
𝑎(𝑥 + 𝑦) = 𝑎𝑥 + 𝑎𝑦
𝑢 + 𝑣, 𝑥=𝑢, 𝑥+𝑣, 𝑥
𝑢, 𝑥 + 𝑦=𝑢, 𝑥+𝑢, 𝑦
𝑎 *𝑢, 𝑥=𝑎𝑢, 𝑥
𝑎 *𝑢, 𝑥=𝑢, 𝑎𝑥
𝐴 (𝑥 + 𝑦) = 𝐴 𝑥 + 𝐴 𝑦
𝑎𝐴 𝑥 = 𝐴 (𝑎𝑥)
𝐴𝑢, 𝑥=𝑢, 𝐴 𝑥

A central part of automated theorem proving (ATP) is constructing equality proofs such as (4.1) from (4.2) automatically. This can be done with well-researched techniques from the field of term rewriting systems [BN98[BN98]Baader, Franz; Nipkow, TobiasTerm rewriting and all that (1998)publisher Cambridge University Press(link)]. These techniques take advantage of the fact that computers can perform many operations per second, and large search spaces can be explored quickly, though heuristic functions are still needed to prevent a combinatorial explosion. Many domains - such as checking that two expressions are equal using the ring axioms - also have specialised decision procedures available for them. I'll call these approaches to solving equalities machine-oriented; this contrasts with human-oriented as discussed in Section 2.6.

In accordance with the research goals of this thesis (Section 1.2), the purpose here is to investigate alternative, human-like ways of producing equality proofs. As motivated in Section 1.1, this serves the purpose of improving the usability of proof assistants by making the proofs generated more understandable (Section 2.5). The goal of this chapter is not to develop methods that compete with machine-oriented techniques to prove more theorems or prove them faster. Instead, I want to focus on the abstract reasoning that a human mathematician typically carries out when they encounter an equality reasoning problem such as (4.1).

With this in mind, the goal of this chapter is to create an algorithm which:

Typically, existing ATP methods do not scale well with the number of competing rules introduced, as one would expect of algorithms that make use of significant amounts of brute-force search. If we can devise new architectures that solve simple equalities with less search, then it may be possible to scale up these techniques to larger problems and improve the efficiency of established ATP methods.

This chapter presents the subtask algorithm which has some success with respect to the above goals. The algorithm is written in Lean 3 [MKA+15[MKA+15]de Moura, Leonardo; Kong, Soonho; Avigad, Jeremy; et al.The Lean theorem prover (system description) (2015)International Conference on Automated Deduction(link)] and can be found on GitHub. The work in this chapter also appears as a paper published in KI 2019 [AGJ19[AGJ19]Ayers, E. W.; Gowers, W. T.; Jamnik, MatejaA human-oriented term rewriting system (2019)KI 2019: Advances in Artificial Intelligence - 42nd German Conference on AI(link)]. In the remainder of the chapter I give a motivating example (Section 4.2) followed by a description of the algorithm (Section 4.3). The algorithm is then contrasted with existing approaches (Section 4.4) and evaluated against the above goals (Section 4.5).

4.2. Example

Let us begin with a motivating example (4.1) in elementary linear algebra. We have to solve the goal of the Box (4.3) using the rewrite rules given in (4.2).


The Box representing the task to solve in this instance. Full detail on Box is given in Chapter 3. For the purposes of this chapter, a Box represents a theorem to prove with a list of variables and hypotheses above the line and a goal proposition to prove below the line.

V : VectorSpace
𝑥 𝑢 𝑣 : V
𝐴 : VV

𝐴(𝑢 + 𝑣), 𝑥=𝐴𝑢 + 𝐴𝑣, 𝑥

To do this, a human's proving process might proceed as follows:

List 4.4

A sketch of a human's possible thought process when constructing an equality proof for (4.3).

  1. I need to create the expression 𝐴𝑢 + 𝐴𝑣, 𝑥.
  2. In particular, I need to make the subexpressions 𝐴𝑢 and 𝐴𝑣. Let's focus on 𝐴𝑢.
  3. The only sensible way I can get this is to use the definition 𝑢, 𝐴 ?𝑧=𝐴𝑢, ?𝑧, presumably with ?𝑧 = 𝑥.
  4. In particular, I'll need to make the subterm 𝐴 ?𝑧 for some ?𝑧.
  5. I can do that straight away: 𝐴(𝑢 + 𝑣), 𝑥=𝑢 + 𝑣, 𝐴 𝑥 using the rewrite rule 𝑤 𝑧,𝐴𝑤, 𝑧=𝑤, A 𝑧.
  6. Now I'm in a position to obtain the subexpression 𝑢, 𝐴 𝑥 I wanted in step 3, so let me do that using bilinearity: 𝑢 + 𝑣, 𝐴 𝑥=𝑢, 𝐴 𝑥+𝑣, 𝐴 𝑥.
  7. And now I can get the subexpression 𝐴𝑢 I wanted even earlier in step 2, so let me do that: 𝑢, 𝐴 𝑥+𝑣, 𝐴 𝑥=𝐴𝑢, 𝑥+𝑣, 𝐴 𝑥.
  8. In step 2 I also wanted to create 𝐴𝑣, which I can now get too: 𝐴𝑢, 𝑥+𝑣, 𝐴 𝑥=𝐴𝑢, 𝑥+𝐴𝑣, 𝑥.
  9. And with one application of bilinearity I'm home: 𝐴𝑢, 𝑥+𝐴𝑣, 𝑥=𝐴𝑢 + 𝐴𝑣, 𝑥.

The key aspect of the thought process in List 4.4 is the setting of intermediate aims, such as obtaining certain subexpressions when one does not immediately see how to obtain the entire expression. Let's do this by creating a tree of subtasks Figure 4.5.

Figure 4.5

The subtask tree for solving (4.3): 𝐴(𝑢 + 𝑣), 𝑥=𝐴𝑢 + 𝐴𝑣, 𝑥. Circled numbers correspond to steps in List 4.4, so the 'focus' of the algorithm travels around the tree as it progresses. Details on how this tree is generated will follow in Section 4.3.

The tree in Figure 4.5 represents what the algorithm does with the additivity-of-adjoint problem (4.3). It starts with the subtask create_all𝐴𝑢 + 𝐴v, x at . Since it cannot achieve that in one application of an available rule, it creates a set of subtasks and then chooses the one that is most promising: later in Section 4.3, I will explain how it generates and evaluates possible subtasks. In this case the most promising subtask is create 𝐴𝑢, so it selects that in and identifies a rewrite rule - the definition of adjoint: 𝑤 𝑧,𝐴𝑤, 𝑧=𝑤, 𝐴 𝑧 - that can achieve it; adding use𝑢, 𝐴 ?𝑧=𝐴𝑢, ?𝑧 to the tree at . The ?𝑧 that appears at in Figure 4.5 is a metavariableThat is, a placeholder for an expression to be chosen later. See Section 2.4 for background information on metavariables. that will in due course be assigned to 𝑥. Now the process repeats on , a set of subtasks are again created for the lhs of 𝑢, 𝐴 ?𝑧=𝐴𝑢, ?𝑧 and the subtask create 𝐴 ?𝑧 is selected (). Now, there does exist a rewrite rule that will achieve create 𝐴 ?𝑧: 𝐴(𝑢 + 𝑣), 𝑥=𝑢 + 𝑣, 𝐴 𝑥, so this is applied and now the algorithm iterates back up the subtask tree, testing whether the new expression 𝑢 + 𝑣, 𝐴 𝑥 achieves any of the subtasks and whether any new rewrite rules can be used to achieve them.

In the next section, I will provide the design of an algorithm that behaves according to these motivating principles.

4.3. Design of the algorithm

The subtasks algorithm may be constructed as a search over a directed graph S.

The subtask algorithm's state 𝑠 : S has three components 𝑡, 𝑓, 𝑐:

Given an equational reasoning problem Γ𝑙 = 𝑟, the initial state 𝑠₀ : S consists of a tree with a single root node CreateAll 𝑟 and a CE 𝑙. We reach a goal state when the current expression 𝑐 is definitionally equalThat is, the two expressions are equal by reflexivity. to 𝑟.

The first thing to note is that if we project 𝑠 : S to the current expression 𝑐, then we can recover the original equational rewriting problem E by taking the edges to be all possible rewrites between terms. One problem with searching this space is that the number of edges leaving an expression can be infiniteFor example, the rewrite rule 𝑎 𝑏, 𝑎 = 𝑎 - 𝑏 + 𝑏 can be applied to any expression with any expression being substituted for 𝑏. The typical way that this problem is avoided is to first ground all available rewrite rules by replacing all free variables with relevant expressions. The subtasks algorithm does not do this, because this is not a step that humans perform when solving simple equality problems. Even after grounding, the combinatorial explosion of possible expressions makes E a difficult graph to search without good heuristics. The subtasks algorithm makes use of the subtasks tree 𝑡 to guide the search in E in a manner that is intended to match the process outlined in List 4.4 and Figure 4.5.

A task 𝑡 : Task implements the following three methods:

This design enables the system to be modular, where different sets of tasks and strategies can be included. Specific examples of tasks and strategies used by the algorithm are given in {#the-main-subtasks}. Given a state 𝑠 : S, the edges leading from 𝑠 are generated using the flowchart shown in Figure 4.6.

Let 𝑓 be the focussed subtask for 𝑠. In the case that test(𝑓) is true the algorithm 'ascends' the task tree. In this branch, 𝑓 is tagged as 'achieved' and the new focussed task is set as parent of 𝑓. Then, it is checked whether any siblings of 𝑓 that were marked as achieved are no longer achieved (that is, there is a sibling task 𝑡 tagged as achieved but test(𝑡) is now false). The intuition behind this check on previously achieved subtasks is that once a sibling task is achieved, it should not be undone by a later step because the assumption is that all of the child subtasks are needed before the parent task can be achieved.

In the case that test(𝑓) is false, meanwhile, the algorithm 'explores' the task tree by finding new child subtasks for 𝑓. To do this, refine(𝑓) is called to produce a set of candidate subtasks for 𝑓. For each 𝑡refine(𝑓), 𝑡 is inserted as a child of 𝑓 provided that test(𝑡) is false and 𝑡 does not appear as an ancestor of 𝑓. Duplicate children are removed. Finally, for each subtask 𝑡, a new state is yielded with the focus now set to 𝑡. Hence 𝑠's outdegreeThe outdegree of a vertex 𝑣 in a directed graph is the number of edges leaving 𝑣. in the graph will be the number of children that 𝑓 has after refining.

Figure 4.6

Flowchart for generating edges for a starting state 𝑠 : S. Here, each call to yield state will produce another edge leading from 𝑠 to the new state.

Now that the state space S, the initial state 𝑠₀, the goal states and the edges on S are defined, we can perform a search on this graph with the help of a heuristic function h : S[0,] to be discussed in Section 4.3.2. The subtasks algorithm uses greedy best-first search with backtracking points. However, other graph-search algorithms such as A⋆ or hill-climbing may be used.

4.3.1. The defined subtasks

In this section I will provide a list of the subtasks that are implemented in the system and some justification for their design. create_all 𝑒

The create_all : ExprTask task is the root task of the subtask tree.

The motivation behind the refinement rule is that since 𝑏 appears in 𝑒 but not in the current expression, then it must necessarily arise as a result of applying a rewrite rule. Rather than include every subterm of 𝑒 with this property, we need only include the minimal subterms with this property since if 𝑏𝑏', then test(create 𝑏)test(create 𝑏'). In the running example (4.3), the subtasks of create_all𝐴𝑢 + 𝐴𝑣, 𝑥 are create (𝐴𝑢) and create (𝐴𝑣). create 𝑒

The create task is achieved if the current expression contains 𝑒.

Given a rewrite rule 𝑟 :(..𝑥𝑠), 𝑎 = 𝑏, say that an expression 𝑒 overlaps with the right hand side of 𝑟 when there exists a most-general substitution σ on 𝑟's variables 𝑥𝑠 such that

Additionally, as mentioned, create 𝑒 can sometimes refine to yield a reduce_distance subtask. The condition for this depends on the distance between two subterms in a parent expression 𝑐 : Expr, which is defined as the number of edges between the roots of the subterms -- viewing 𝑐's expression tree as a graph. If two local variables 𝑥, 𝑦 are present exactly once in both the current expression and 𝑒, and the distance between them is greater in the current expression, then reduce_distance 𝑥 𝑦 is included as a subtask.

In order to handle cases where multiple copies of 𝑒 are required, create has an optional count argument that may be used to request an nth copy of 𝑒. use (𝑎 = 𝑏)

This is the simplest strategy. It simply represents the subtask of using the given rewrite rule. reduce_distance (𝑥, 𝑦)

reduce_distance is an example of a greedy, brute-force strategy. It will perform any rewrite rule that moves the given variables closer together and then terminate.

4.3.2. Heuristics

In this section I present the heuristic function developed for the subtasks algorithm. The ideas behind this function are derived from introspection on equational reasoning and some degree of trial and error on a set of equality problems.

There are two heuristic functions that are used within the system, an individual strategy heuristic and an 'overall-score' heuristic that evaluates sets of child strategies for a particular task. Overall-score is used on tasks which are not strategies by performing a lookahead of the child strategies of the task. The child strategies 𝑆₁, 𝑆₂ are then scored individually through a scoring system, scoring higher if they:

The intuition behind all of these is to give higher scores to strategies that are salient in some way, either by containing subterms that are present in the current expression or because other subtasks are achieved.

From these individual scores, the score for the parent task of 𝑆₁, 𝑆₂ ... is computed as follows: If there is only one strategy then it scores 10. If there are multiple strategies, it discards any scoring less than -5. If there are positive-scoring strategies then all negative-scoring strategies are discarded. The overall score is then set to be 5 minus the number of strategies in the list. The intention of this procedure is that smaller sets of strategies should be preferred, even if their scores are bad because it limits choice in what to do next.

The underlying idea behind the overall-scoring heuristic is that often the first sensible strategy found is enough of a signpost to solve simple problems. That is, once one has found one plausible strategy of solving a simple problem it is often fruitful to stop looking for other strategies which achieve the same thing and to get on with finding a way of performing the new strategy.

4.3.3. Properties of the algorithm

The substasks algorithm is sound provided sound rewrite rules are produced by the function execute : TaskOption Rewrite. That is, given an equation to solve Γ𝑙 = 𝑟 and given a path 𝑠₀𝑠₁...𝑠 in S where 𝑠₀ is the initial state defined in By forgetting the subtask tree, a solution path in S can be projected to a solution path in E, the equational rewriting graph. This projected path is exactly a proof of 𝑙 = 𝑟; it will be composed of a sequence 𝑙𝑐₀ = 𝑐₁ = ... = 𝑐ₙ ≡ 𝑟 where 𝑐 is the current expression of 𝑠. Each equality in the chain holds either by the assumption of the proofs returned from execute being sound or by the fact that the current expression doesn't change between steps otherwise.

The next question to ask is whether S is complete with respect to E. That is, does S contain a path to the solution whenever E contains one? The answer to this depends on the output of refine. If refine always returns an empty list of subtasks then S is not complete, because no subtasks will ever be executed. The set of subtasks provided in Section 4.3.1 are not complete. For example the problem 1 - 1 = 𝑥 + - 𝑥 will not solve without additional subtasks since the smallest non-present subterm is 𝑥, so create 𝑥 is added which then does not refine further using the procedure in Section 4.3.1. In Section 4.6 I will discuss some methods to address this.

4.4. Qualitative comparison with related work

There has been a substantial amount of research on the automation of solving equality chain problems over the last decade. The approach of the subtasks algorithm is to combine these rewriting techniques with a hierarchical search. In this section I compare subtasks which with this related work.

4.4.1. Term Rewriting

One way to find equality proofs is to perform a graph search using a heuristic. This is the approach of the rewrite-search algorithm by Hoek and Morrison [HM19[HM19]Hoek, Keeley; Morrison, Scottlean-rewrite-search GitHub repository (2019)https://github.com/semorrison/lean-rewrite-search], which uses the heuristic of string edit-distance between the strings' two pretty-printed expressions. The rewrite-search algorithm does capture some human-like properties in the heuristic, since the pretty printed expressions are intended for human consumption. The subtasks algorithm is different from rewrite-search in that the search is guided according to achieving sequences of tasks. Since both subtasks and rewrite-search are written in Lean, some future work could be to investigate a combination of both systems.

A term rewriting system (TRS) 𝑅 is a set of oriented rewrite rules. There are many techniques available for turning a set of rewrite rules in to procedures that check whether two terms are equal. One technique is completion, where 𝑅 is converted into an equivalent TRS 𝑅' that is convergent. This means that any two expressions 𝑎, 𝑏 are equal under 𝑅 if and only if repeated application of rules in 𝑅' to 𝑎 and 𝑏 will produce the same expression. Finding equivalent convergent systems, if not by hand, is usually done by finding decreasing orderings on terms and using Knuth-Bendix completion [KB70[KB70]Knuth, Donald E; Bendix, Peter BSimple word problems in universal algebras (1970)Computational Problems in Abstract Algebra(link)]. When such a system exists, automated rewriting systems can use these techniques to quickly find proofs, but the proofs are often overly long and needlessly expand terms.

Another method is rewrite tables, where a lookup table of representatives for terms is stored in a way that allows for two terms to be matched through a series of lookups.

Both completion and rewrite tables can be considered machine-oriented because they rely on large datastructures and systematic applications of rewrite rules. Such methods are certainly highly useful, but they can hardly be said to capture the process by which humans reason.

Finally, there are many normalisation and decision procedures for particular domains, for example on rings [GM05[GM05]Grégoire, Benjamin; Mahboubi, AssiaProving equalities in a commutative ring done right in Coq (2005)International Conference on Theorem Proving in Higher Order Logics(link)]. Domain specific procedures do not satisfy the criterion of generality given in Section 4.1.

4.4.2. Proof Planning

Background information on proof planning is covered in Section 2.6.2.

The subtasks algorithm employs a structure that is similar to a hierarchical task network (HTN) [Sac74[Sac74]Sacerdoti, Earl DPlanning in a hierarchy of abstraction spaces (1974)Artificial intelligence(link), Tat77[Tat77]Tate, AustinGenerating project networks (1977)Proceedings of the 5th International Joint Conference on Artificial Intelligence.(link), MS99[MS99]Melis, Erica; Siekmann, JörgKnowledge-based proof planning (1999)Artificial Intelligence(link)]. The general idea of a hierarchical task network is to break a given abstract task (e.g., "exit the room") in to a sequence of subtasks ("find a door" then "go to door" then "walk through the door") which may themselves be recursively divided into subtasks ("walk through the door" may have a subtask of "open the door" which may in turn have "grasp doorhandle" until bottoming out with a ground actuation such as "move index finger 10°"). This approach has found use for example in the ICARUS robotics architecture [CL18[CL18]Choi, Dongkyu; Langley, PatEvolution of the ICARUS cognitive architecture (2018)Cognitive Systems Research(link), LCT08[LCT08]Langley, Pat; Choi, Dongkyu; Trivedi, NishantIcarus user’s manual (2008)(link)]. HTNs have also found use in proof planning [MS99[MS99]Melis, Erica; Siekmann, JörgKnowledge-based proof planning (1999)Artificial Intelligence(link)].

The main difference between the approach used in the subtasks algorithm and proof planning and hierarchical task networks is that the subtasks algorithm is greedier: the subtasks algorithm generates enough of a plan to have little doubt what the first rewrite rule in the sequence should be, and no more. I believe that this reflects how humans reason for solving simple problems: favouring just enough planning to decide on a good first step, and then planning further only once the step is completed and new information is revealed.

A difference between HTNs and subtasks is that the chains of subtasks do not necessarily reach a ground subtask (for subtasks this is a rewrite step that can be performed immediately). This means that the subtasks algorithm needs to use heuristics to determine whether it is appropriate to explore a subtask tree or not instead of relying on the task hierarchy eventually terminating with ground tasks. The subtasks algorithm also inherits all of the problems found in hierarchical planning: the main one being finding heuristics for determining whether a subtask should be abandoned or refined further. The heuristics given in Section 4.3.2 help with this but there are plenty more ideas from the hierarchical task planning literature that could be incorporated also. Of particular interest for me are the applications of hierarchical techniques from the field of reinforcement learningA good introductory text to modern reinforcement learning is Reinforcement Learning; An Introduction by Sutton and Barto [SB18b]. Readers wishing to learn more about hierarchical reinforcement learning may find this survey article by Flet-Berliac to be a good jumping-off point [Fle19].[SB18b]Sutton, Richard S; Barto, Andrew GReinforcement learning: An introduction (2018)publisher MIT press(link)[Fle19]Flet-Berliac, YannisThe Promise of Hierarchical Reinforcement Learning (2019)The Gradient(link).

4.5. Evaluation

The ultimate motivation behind the subtasks algorithm is to make an algorithm that behaves as a human mathematician would. I do not wish to claim that I have fully achieved this, but we can evaluate the algorithm with respect to the general goals that were given in Chapter 1.

The method of evaluation is to use the algorithm implemented as a tactic in Lean on a library of thirty or so example problems. This is not large enough for a substantial quantitative comparison with existing methods, but we can still investigate some properties of the algorithm. The source code also contains many examples which are outside the abilities of the current implementation of the algorithm. Some ways to address these issues are discussed in Section 4.6.

Table 4.7 gives some selected examples. These are all problems that the algorithm can solve with no backtracking.

Table 4.7

subtask's performance on some example problems. Steps gives the number of rewrite steps in the final proof. Location gives the file and declaration name of the example in the source code.

𝑕 : α
𝑙 𝑠 : List α
rev(𝑙 ++ 𝑠) = rev(𝑠) ++ rev(𝑙)
𝑎, rev(𝑎 :: 𝑙) = rev(𝑙) ++ [𝑎]

rev(𝑕 :: 𝑙 ++ 𝑠) = rev(𝑠) ++ rev(𝑕 :: 𝑙)
A : Monoid
𝑎 : A
𝑚 𝑛 :