Chapter 7
Conclusion

In this thesis, I have presented the design and evaluation for HumanProof and ProofWidgets. In this chapter I will conclude the thesis by reflecting on the research questions given in Chapter 1 and talking about further work.

# 7.1. Revisiting the research questions

Let's review the research questions that I set out to answer in Section 1.2 and outline the contributions that I have made towards them.

## 7.1.1. What constitutes a human-like, understandable proof?

Identify what 'human-like' and 'understandable' mean to different people.

In Section 2.5, I investigated the work of other mathematicians and mathematics educators on the question of what it means to understand a proof of a theorem. This review took me from educational research to the works of Spinoza, but yielded little in the way of explicit answers to this question. In Chapter 6, I asked some students of mathematics at the University of Cambridge what features of a proof made it understandable to them. The participants remarked that a proof being understandable is a function of numerous factors: providing the intuition or motivation of a proof first, signposting the purpose of various sections of a proof and providing the right level of detail. One thing that was frequently stressed, however, was that syntax and notation of proofs only played a minor role in how understandable a proof is; while unfamiliar syntax only hinders understanding temporarily and may be overcome by becoming familiar with the notation.

Distinguish between human-like and machine-like in the context of ITP. A similar review was undertaken in Section 2.6 for what constitutes 'human-like' reasoning. The topic has received attention from early efforts to create proof assistants and automated theorem provers up to the present day. My conclusion from this review is that 'human-like' is best understood as referring to a general approach to ATP algorithm design, in contrast to 'machine-like'. Human-like proving techniques emphasise reasoning methods that are compatible with how humans reason, in the sense that a proof is intelligible for a human. Pre-80s, for example, Robinson's resolution theorem proving was the dominant architecture of provers [BG01[BG01]Bachmair, Leo; Ganzinger, HaraldResolution theorem proving (2001)Handbook of automated reasoninghttps://www.sciencedirect.com/book/9780444508133/handbook-of-automated-reasoning]. However as noted by Bledsoe [Ble81[Ble81]Bledsoe, Woodrow WNon-resolution theorem proving (1981)Readings in Artificial Intelligencehttps://doi.org/10.1016/0004-3702(77)90012-1], repeated application of the resolution rule (A ∨ B) ∧ (¬A ∨ C) ⊢ B ∨ C can hardly be called 'human-like', and such a proof would not be found in a mathematical textbook. Since human-like is defined more in terms of what it is not, there are a wide variety of approaches which may all be described as human-like: proof planning, diagrammatic reasoning, and graphical discourse models. I chose to focus on human-like logical reasoning and modelling how an undergraduate may approach writing a proof.

Merge these strands to create and determine a working definition of human-like. In Section 3.1, I decided that I would deem the design of the system as human-like if it was similar enough to the reasoning of humans and could produce natural language write-ups that were convincing enough for mathematicians. I also restricted myself to only look at elementary 'follow your nose' proofs. That is, simple proofs where at each step of the proof the number of sensible steps (according to a human) is highly restricted.

## 7.1.2. How can human-like reasoning be represented within an interactive theorem prover to produce formalised, understandable proofs?

Form a calculus of representing goal states and inference steps that acts at the abstraction layer that a human uses when solving proofs. In Chapter 3 I detail a development calculus created for the purpose of representing human-like goal states in accordance with the working definition given in Section 7.1.2. The calculus (defined in Section 3.3.2) makes use of a hierarchical proof state structure that also incrementally constructs a formal proof term in dependent type theory. The calculus is compared with other designs (Section 3.3.5), of which the closest is the design to Gowers and Ganesalingam's Robot prover [GG17[GG17]Ganesalingam, Mohan; Gowers, W. T.A fully automatic theorem prover with human-style output (2017)Journal of Automated Reasoninghttps://doi.org/10.1007/s10817-016-9377-1] and McBride's OLEG [McB00[McB00]McBride, ConorDependently typed functional programs and their proofs (2000)PhD thesis (University of Edinburgh)http://hdl.handle.net/1842/374]. I then provided a set of tactics for manipulating these proof states and provide proofs that these tactics are sound (Section 3.4 and Appendix A).

In Chapter 4, I also introduce a new algorithm for creating human-like equational reasoning proofs. This algorithm makes use of a hierarchical set of 'subtasks' to find equality proofs. This system can solve simple equational reasoning tasks in a novel way. It works well as a prototype but the subtasks system will need more improvement before it can be used in a practical setting.

Create a system for also producing natural language proofs from this calculus. The component that performs this is detailed in Section 3.6. The system for verbalising HumanProof-constructed proofs to natural language made use of a classical NLG pipeline. This component need only be good enough to demonstrate that the calculus of HumanProof can create human-like proofs, and so I did not focus on innovating beyond existing work in this field, notably Gowers & Ganesalingam [GG17[GG17]Ganesalingam, Mohan; Gowers, W. T.A fully automatic theorem prover with human-style output (2017)Journal of Automated Reasoninghttps://doi.org/10.1007/s10817-016-9377-1] and earlier systems such as PROVERB [HF97[HF97]Huang, Xiaorong; Fiedler, ArminProof Verbalization as an Application of NLG (1997)International Joint Conference on Artificial Intelligencehttp://ijcai.org/Proceedings/97-2/Papers/025.pdf]. However I did contribute a system for verbalising sequences of binders in dependent type theory in Section 3.6.4.

Evaluate the resulting system by performing a study on real mathematicians. This was discussed in Chapter 6. The study found that mathematicians generally prefer HumanProof proofs to Lean proofs with the exception of equality proofs, where the additional information required to specify each step in an equality chain was preferred. A surprising result of the study was that non-specialist mathematicians do not trust proofs backed with a formal guarantee more than natural language proofs. This result suggests that - considering a proof assistant as a tool for working mathematicians - formalisation can't be a substitute for an understanding of the material.

I made some significant progress towards this research goal, however the solution that I have implemented can be found to stumble upon given harder examples, both in terms of automation and in the write-ups getting progressively clunkier upon growing in size. The implementation as it stands also does not extend to more difficult domains where some detail must be hidden in the name of brevity. I will outline some specific solutions to these issues in Section 7.2. In the end, I chose to focus less on extending the automation of tactics beyond what was available in Robot and instead focus on subtasks described in Chapter 4 and interactive theorem proving through a graphical user interface.

I believe that these defects could be fixed with more research, however one has to ask whether such a human-research-intensive approach is going to be a good long-term solution. This question becomes particularly salient when faced with the advent of large-scale deep learning language models:

Very recently, we are starting to see applications of attention based models [VSP+17[VSP+17]Vaswani, Ashish; Shazeer, Noam; Parmar, Niki; et al.Attention is All you Need (2017)Neural Information Processing Systemshttps://proceedings.neurips.cc/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html] (also known as transformers) to the problem of predicting human-written proofs of mathematics with promising results: Li et al direct transformer models towards predicting steps in Isabelle/Isar[Wen99], see Section 2.6.1[Wen99]Wenzel, MarkusIsar - A Generic Interpretative Approach to Readable Formal Proof Documents (1999)Theorem Proving in Higher Order Logicshttps://doi.org/10.1007/3-540-48256-3_12 proofs [LYWP21[LYWP21]Li, Wenda; Yu, Lei; Wu, Yuhuai; et al.IsarStep: a Benchmark for High-level Mathematical Reasoning (2021)International Conference on Learning Representationshttps://openreview.net/forum?id=Pzj6fzU6wkj]. See also Lample and Charton's work on applying transformers to algebraic reasoning and integration [LC20[LC20]Lample, Guillaume; Charton, FrançoisDeep Learning For Symbolic Mathematics (2020)ICLRhttps://openreview.net/forum?id=S1eZYeHFDS]. Some work in this space that I have been involved with is with Han, Rute, Wu and Polu [HRW+21[HRW+21]Han, Jesse Michael; Rute, Jason; Wu, Yuhuai; et al.Proof Artifact Co-training for Theorem Proving with Language Models (2021)arXiv preprint arXiv:2102.06203https://arxiv.org/pdf/2102.06203] on training GPT3 [BMR+20[BMR+20]Brown, Tom B.; Mann, Benjamin; Ryder, Nick; et al.Language Models are Few-Shot Learners (2020)NeurIPShttps://proceedings.neurips.cc/paper/2020/hash/1457c0d6bfcb4967418bfb8ac142f64a-Abstract.html] to predict Lean 3 tactics. The success of this approach strongly suggests that deep learning methods will play a critical role to the future of human-like automated reasoning. Through the use of statistical learning, the nuances of generating natural language and determining a precise criterion for what counts as 'human-like' can be avoided by simply providing a corpus of examples of human-like reasoning. Deep learning models are notorious for being data-hungry, and so there are still many questions remaining on how the data will be robustly extracted from our mathematical texts and formal proof archives. Perhaps few-shot techniques (see [BMR+20]) will help here. The research touched on above indicates that this method is not incompatible with also producing formalised proofs, although some care will need to be taken to be sure that the formal proofs and the human-readable accounts correspond correctly to each other.

## 7.1.3. How can this mode of human-like reasoning be presented to the user in an interactive, multimodal way?

Investigate new ways of interacting with proof objects. The result of working on this subgoal was the interactive expression engine of the ProofWidgets framework as discussed in Chapter 5. This system follows a long history of research on 'proof-by-pointing' starting with Bertot and Théry [BT98[BT98]Bertot, Yves; Théry, LaurentA generic approach to building user interfaces for theorem provers (1998)Journal of Symbolic Computationhttps://doi.org/10.1006/jsco.1997.0171], and my approach mainly follows similar work in other systems, for example, the system found in KeY [ABB+16[ABB+16]Ahrendt, Wolfgang; Beckert, Bernhard; Bubel, Richard; et al.Deductive Software Verification - The KeY Book (2016)publisher Springerhttps://doi.org/10.1007/978-3-319-49812-6]. My approach is unique in the coupling of the implementation of proof-by-pointing with the general purpose ProofWidgets framework.

Make it easier to create novel GUIs for interactive theorem provers. This was the primary mission of Chapter 5. As noted in Section 5.1, there are many existing GUI systems that are used to create user interfaces for interactive theorem proving. In Chapter 5 I contribute an alternative paradigm for creating user interfaces where the metalanguage of the prover itself is used to create proofs. The ProofWidgets system as implemented in Lean 3 is already in use today.

Produce an interactive interface for a human-like reasoning system. In Section 5.8, I connected the ProofWidgets framework to the Box datastructure developed in Chapter 3 to create an interactive, formalised human-like proof assistant. This serves as a prototype to achieve the research goal. There are many more implementation improvements that could be made and future directions are provided in Section 7.2.

I hope that this work will be viewed as a modern revival of the spirit and approach taken by the older, proof-planning-centric provers such as LΩUI for Ωmega [SHB+99[SHB+99]Siekmann, Jörg; Hess, Stephan; Benzmüller, Christoph; et al.LOUI: Lovely OMEGA user interface (1999)Formal Aspects of Computinghttps://doi.org/10.1007/s001650050053, BCF+97[BCF+97]Benzmüller, Christoph; Cheikhrouhou, Lassaad; Fehrer, Detlef; et al.Ωmega: Towards a mathematical assistant (1997)Automated Deduction - CADE-14https://doi.org/10.1007/3-540-63104-6_23] and XBarnacle for CLAM [LD97[LD97]Lowe, Helen; Duncan, DavidXBarnacle: Making Theorem Provers More Accessible (1997)14th International Conference on Automated Deductionhttps://doi.org/10.1007/3-540-63104-6_39]. This spirit was to make proof assistants accessible to a wider userbase through the use of multi-modal user interfaces that could represent the proofs in many different ways. I want to rekindle some of this optimism that better user interfaces can lead to wider adoption of ITP.

# 7.2. Future work and closing remarks

There are still many things that I want to do to HumanProof and to investigate the world of human-like automated reasoning. Some of the more technical and chapter-specific ideas for future work are covered in their respective chapters (Section 3.7, Section 4.6, Section 5.9, Section 6.8), in this section I restrict my attention to future research directions in a broader sense.

In this thesis, the purpose of the natural language generator was to demonstrate that the system was human-like. However, one question that arises from the evaluation study in Chapter 6 is whether natural language generation is useful for creating accessible ITPs. Participants were generally willing to learn to use a new syntax and language for mathematicsAlthough note that this may be due to sample selection bias (see Section 6.7)., which suggests that the main hurdle to adoption is not the use of a technical language. An additional focus group or study investigating whether natural language proofs play a role in easing the use of a theorem prover would be helpful in determining whether natural language generation of mathematics should be pursued in the future.

The evaluation showed that mathematicians value signpostingAs discussed in Section 6.6.1, signposting here means an indication of how the proof is going to progress without actually performing any inference, motivation and the right level of detail. I didn't focus on these aspects so much in the design of HumanProof. Is there a way of automating these more expository aspects of human-written proofs? The question of determining the right level of exposition has some subjective and audience-specific component to it, however I suspect that it is still possible to make some progress in this direction: the gap in comprehensibility between a human-written proof and a generated proof for any non-trivial example is undeniable. Rather than trying to build an 'expert system' of determining the right level of exposition, I think that the right appraoch is to use modern machine learning approaches as touched on in Section 7.1.2.

## 7.2.1. Closing remarks

HumanProof is a prototype and thus not production-ready software for day-to-day formalisation. I do hope that HumanProof will provide some ideas and inspiration in how the theorem provers of the future are designed.

# Bibliography for this chapter

• [ABB+16]Ahrendt, Wolfgang; Beckert, Bernhard; Bubel, Richard; Hähnle, Reiner; Schmitt, Peter H.; Ulbrich, MattiasDeductive Software Verification - The KeY Book (2016)publisher Springerdoi 10.1007/978-3-319-49812-6isbn 978-3-319-49811-9https://doi.org/10.1007/978-3-319-49812-6
• [BCF+97]Benzmüller, Christoph; Cheikhrouhou, Lassaad; Fehrer, Detlef; Fiedler, Armin; Huang, Xiaorong; Kerber, Manfred; Kohlhase, Michael; Konrad, Karsten; Meier, Andreas; Melis, Erica; Schaarschmidt, Wolf; Siekmann, Jörg H.; Sorge, VolkerΩmega: Towards a mathematical assistant (1997)Automated Deduction - CADE-14volume 1249pages 252--255editor McCune, Williampublisher Springerdoi 10.1007/3-540-63104-6_23https://doi.org/10.1007/3-540-63104-6_23
• [BG01]Bachmair, Leo; Ganzinger, HaraldResolution theorem proving (2001)Handbook of automated reasoningpages 19--99editors Robinson, J. A.; Voronkov, A.publisher Elsevierhttps://www.sciencedirect.com/book/9780444508133/handbook-of-automated-reasoning
• [Ble81]Bledsoe, Woodrow WNon-resolution theorem proving (1981)Readings in Artificial Intelligencepages 91--108editor Meltzer, Bernardpublisher Elsevierdoi 10.1016/0004-3702(77)90012-1https://doi.org/10.1016/0004-3702(77)90012-1
• [BMR+20]Brown, Tom B.; Mann, Benjamin; Ryder, Nick; Subbiah, Melanie; Kaplan, Jared; Dhariwal, Prafulla; Neelakantan, Arvind; Shyam, Pranav; Sastry, Girish; Askell, Amanda; Agarwal, Sandhini; Herbert-Voss, Ariel; Krueger, Gretchen; Henighan, Tom; Child, Rewon; Ramesh, Aditya; Ziegler, Daniel M.; Wu, Jeffrey; Winter, Clemens; Hesse, Christopher; Chen, Mark; Sigler, Eric; Litwin, Mateusz; Gray, Scott; Chess, Benjamin; Clark, Jack; Berner, Christopher; McCandlish, Sam; Radford, Alec; Sutskever, Ilya; Amodei, DarioLanguage Models are Few-Shot Learners (2020)NeurIPSeditors Larochelle, Hugo; Ranzato, Marc'Aurelio; Hadsell, Raia; et al.https://proceedings.neurips.cc/paper/2020/hash/1457c0d6bfcb4967418bfb8ac142f64a-Abstract.html
• [BT98]Bertot, Yves; Théry, LaurentA generic approach to building user interfaces for theorem provers (1998)Journal of Symbolic Computationvolume 25number 2pages 161--194publisher Elsevierdoi 10.1006/jsco.1997.0171https://doi.org/10.1006/jsco.1997.0171
• [GG17]Ganesalingam, Mohan; Gowers, W. T.A fully automatic theorem prover with human-style output (2017)Journal of Automated Reasoningvolume 58number 2pages 253--291doi 10.1007/s10817-016-9377-1https://doi.org/10.1007/s10817-016-9377-1
• [HF97]Huang, Xiaorong; Fiedler, ArminProof Verbalization as an Application of NLG (1997)International Joint Conference on Artificial Intelligencepages 965--972http://ijcai.org/Proceedings/97-2/Papers/025.pdf
• [HRW+21]Han, Jesse Michael; Rute, Jason; Wu, Yuhuai; Ayers, Edward W; Polu, StanislasProof Artifact Co-training for Theorem Proving with Language Models (2021)arXiv preprint arXiv:2102.06203https://arxiv.org/pdf/2102.06203
• [LC20]Lample, Guillaume; Charton, FrançoisDeep Learning For Symbolic Mathematics (2020)ICLRpublisher OpenReview.nethttps://openreview.net/forum?id=S1eZYeHFDS
• [LD97]Lowe, Helen; Duncan, DavidXBarnacle: Making Theorem Provers More Accessible (1997)14th International Conference on Automated Deductionvolume 1249pages 404--407editor McCune, Williampublisher Springerdoi 10.1007/3-540-63104-6_39https://doi.org/10.1007/3-540-63104-6_39
• [LYWP21]Li, Wenda; Yu, Lei; Wu, Yuhuai; Paulson, Lawrence C.IsarStep: a Benchmark for High-level Mathematical Reasoning (2021)International Conference on Learning Representationshttps://openreview.net/forum?id=Pzj6fzU6wkj
• [McB00]McBride, ConorDependently typed functional programs and their proofs (2000)http://hdl.handle.net/1842/374
• [SHB+99]Siekmann, Jörg; Hess, Stephan; Benzmüller, Christoph; Cheikhrouhou, Lassaad; Fiedler, Armin; Horacek, Helmut; Kohlhase, Michael; Konrad, Karsten; Meier, Andreas; Melis, Erica; Pollet, Martin; Sorge, VolkerLOUI: Lovely OMEGA user interface (1999)Formal Aspects of Computingvolume 11number 3pages 326--342editor Woodcock, Jamespublisher Springerdoi 10.1007/s001650050053https://doi.org/10.1007/s001650050053
• [VSP+17]Vaswani, Ashish; Shazeer, Noam; Parmar, Niki; Uszkoreit, Jakob; Jones, Llion; Gomez, Aidan N.; Kaiser, Lukasz; Polosukhin, IlliaAttention is All you Need (2017)Neural Information Processing Systemspages 5998--6008editors Guyon, Isabelle; von Luxburg, Ulrike; Bengio, Samy; et al.https://proceedings.neurips.cc/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html
• [Wen99]Wenzel, MarkusIsar - A Generic Interpretative Approach to Readable Formal Proof Documents (1999)Theorem Proving in Higher Order Logicsvolume 1690pages 167--184editors Bertot, Yves; Dowek, Gilles; Hirschowitz, André; et al.publisher Springerdoi 10.1007/3-540-48256-3_12https://doi.org/10.1007/3-540-48256-3_12