Introduce Minilang and our machine learning toolkit · Announcements

Stream: Announcements

Topic: Introduce Minilang and our machine learning toolkit

Qiyuan Xu (Dec 24 2025 at 15:37):

Merry Christmas, my dear colleagues in Isabelle.

Neural Theorem Proving (NTP) is an automated reasoning technique that applies LLMs or other neural models to generate proofs, offering a promising approach to free ITP users from manual proof construction. While Lean may appear to dominate NTP development in recent years, Isabelle actually led the field as recently as two years ago. When I asked researchers why they have shifted to Lean, they cited Isabelle's lack of adequate infrastructure for machine learning and data mining.

We believe this gap has now been addressed through our development of a comprehensive machine learning toolkit for Isabelle, which includes a REPL, precompiled AFP heap images, and a minimalist proof language specifically designed for machine learning. This language, _Minilang_, aims to reduce the conceptual complexity that LLMs must learn, thereby improving training efficiency. Experiments demonstrate that models trained with the most basic approach (supervised finetuning) can prove 79% of PISA problems (a set of randomly selected proof goals from AFP) within 8 attempts. A translator from Isar to Minilang is also provided, successfully translating 85% of proofs in AFP and the stdlib of Isabelle/HOL. This work has been accepted in OOPSLA'26, and a pre-print is available on arXiv.

While Lean currently maintains its lead in AI for mathematical competitions, the landscape differs for practical proof engineering tasks (e.g., those from Isabelle's AFP and Lean's mathlib). On LeanDojo, a counterpart of PISA in Lean, the best result achieves 60% (by Alchemy), while Minilang's most basic LLM (a basic SFT with naive prompt context setting) reaches 79% on PISA. Although these results are from different benchmarks and not directly comparable, Minilang still positions Isabelle once again at the forefront of AI-driven theorem proving for practical proof engineering.

We envision Minilang as a foundational component of the AI toolkit ecosystem for Isabelle. To this end, we have open-sourced all code and data (see the end of our paper) and are committed to providing full technical support to Minilang and our toolkit users.

Our team will continue developing AI provers and tools for Isabelle, with the ultimate goal of providing production-ready, push-button solutions. We will update on our progress.

Long live Isabelle, and may it continue to prosper.
Qiyuan

Qiyuan Xu (Jan 12 2026 at 16:05):

@Mario Xerxes Castelán Castro opinion is welcome :)

Yosuke Ito (Jan 14 2026 at 01:21):

@Qiyuan Xu
I'm not an expert on AI, but your research is interesting to me.

Mario Xerxes Castelán Castro (Jan 15 2026 at 01:23):

Qiyuan Xu ha dicho:

Mario Xerxes Castelán Castro opinion is welcome :)

Quality: LLMs provide sub-standard output in almost all cases. The problem of hallucinations is well-known. The problem of being bad at generalizing is well-known. Even when filtered of hallucinations through an ITP system like Lean or Isabelle, the proofs are not as elegant as human-written proofs.
AI slop: LLMs enable that venues open to public participation get flooded by AI slop. As an illustration of this: A Lean maintainer told me that removing undisclosed AI slop is the most common moderation action.
Credit and copyleft: LLMs train on human-produced work without giving credit and without giving anything back. Many models are secret. Training data is almost always kept secret. I am in favor of free use of information with due credit. LLM developers should credit the humans that created the dataset. LLMs themselves should credit the humans that provided the information in every answer; note that citing sources in RAG only partially meets this requirement. LLMs trained on copylefted material should be publicly available as free software.
Anti-intellectual narrative: This is part of a general trend of pushing LLMs on all domains that require thinking, often with the stated goal of replacing humans. LLMs systems are nowhere close to this and results seen so far suggests they are incapable of it. The stated goal and narrative is still wrong. Humans having to do thinking is good. We are supposed to be the rational animal, the ones that come up with intellectually valuable output, not the babysitters that fed AI systems. We should tell people to think more. The narrative behind the push of LLMs is that people should offload their thinking to LLMs.
Economic costs: AI hype leads to companies neglecting consumer's interest for the sake of business-to-bussiness products. The buildout of AI datacenters is driven by hype, not by real demand and it has led to the several-fold increase in RAM prices. This causes increase of cost of all computer hardware. This directly affects me and you. Hardware companies in the Consumer Electronics Show 2026 neglected talking about consumer electronics for the sake of datacenter AI hardware. When the bubble crashes, there will be economic instability which will directly affect me and you again.

Qiyuan Xu (Jan 23 2026 at 18:30):

@Mario Xerxes Castelán Castro , will respond to you later, but let me update my progress first.

The recent AI4math trends to use non-finetuned agents, and Numina Agent hit the SOTA on the PutnamBench.
This could be a great opportunity for Isabelle. Previously, a major pain point for me was the lack of math competition data for Isabelle—the equivalent datasets for Lean were created by hiring experts at a cost of millions of dollars (based on what I've gathered from some AI companies). Without such data, it was nearly impossible to train LLMs for math competition problems in Isabelle. But now, agents built on general-purpose LLMs (Claude, Gemini) without any fine-tuning have surprisingly achieved SOTA—and Numina Agent is open-source. This means I can directly adapt and improve its implementation for Minilang. Since Minilang has better infrastructure than Lean (because I, as an Isabelle hacker, am confident in my work), the agent should perform even better. This might just correct some misconceptions held by certain overconfident AI folks in the AI4Math community. I'm currently visiting MBZUAI and leveraging their nearly unlimited CPU clusters and GPUs for data extraction and training. I hope to share more good news soon.

Last updated: Mar 11 2026 at 08:47 UTC