Stream: Mirror: Isabelle Users Mailing List

Topic: [isabelle] Introducing AutoReal: A Cost-Efficient Local L...


view this post on Zulip Email Gateway (Mar 15 2026 at 12:42):

From: Jianyu Zhang <jianyu_zhang@zju.edu.cn>

Dear Isabelle users,

We have been following the recent discussion on LLM-supported proof generation
in Isabelle, especially the questions around practical usefulness,
computational cost, and how such systems should relate to established tools
such as Sledgehammer and other automation methods. In that context, we thought
it might be useful to briefly introduce our recent work, AutoReal, from the
research group of Prof. Yongwang Zhao at the College of Computer Science and
Technology, Zhejiang University, China, and we would be very grateful for any
feedback from the community.

AutoReal is an LLM-driven prover built around a locally deployable 7B model
(based on Qwen2.5-Coder). One motivation behind our work is exactly the
concern raised in the earlier discussion: frontier proprietary models such as
GPT-5 and Claude can be very powerful, but they are also expensive, opaque,
and difficult to reproduce in everyday theorem-proving workflows. By contrast,
AutoReal is designed to run locally with a much lower deployment barrier and
operating cost. In our experiments, AutoReal can be run locally and
efficiently on a laptop GPU with 24 GB of VRAM for standard inference.

A second point is scope. Our target is not primarily olympiad-style
mathematics or small synthetic benchmarks, but proof obligations arising in
real verification projects. In particular, we focus on industrial-scale
verification settings such as seL4. Given the lemma statement and proof
context (including the relevant lemmas and definitions needed for the proof),
AutoReal achieves a proof success rate of 51.67% (341/660) on 660 theorems
drawn from the important theories of seL4 and covering all major proof
categories. We also evaluated AutoReal under the same setting on three
security-related AFP developments — CRYSTALS-Kyber_Security, RSAPSS, and
Elliptic_Curves_Group_Law — where it achieves an overall proof success rate of
53.88%.

Another aspect we have found useful in practice is that AutoReal outputs not
only candidate proof steps, but also a natural-language explanation for each
step, describing what the step does and how it affects the proof state. Our
hope is that this makes LLM-generated proofs easier for human verifiers to
inspect, understand, and repair.

We are also working on integrating AutoReal into Isabelle, with the hope that
it can eventually work in a way similar to Sledgehammer. This integration is
still incomplete, and one of the main current challenges is how to extract
proof context automatically.

If this sounds interesting, we would be very happy if members of the community
would like to take a look at the paper:

Paper:
https://doi.org/10.48550/arXiv.2602.08384

We have also open-sourced the weights of our fine-tuned model here:
https://doi.org/10.5281/zenodo.18412012

We see AutoReal as complementary to existing tools rather than a replacement
for them, and we would sincerely welcome comments, criticism, suggestions, or
possible collaboration. In particular, we would be very interested in feedback
on Isabelle-side integration, context extraction, evaluation settings, and use
cases that matter to practitioners.

Best regards,

Jianyu Zhang
Research group of Prof. Yongwang Zhao
College of Computer Science and Technology, Zhejiang University, China
2026-3-15


Last updated: Mar 21 2026 at 20:30 UTC