Stream: General

Topic: Looking for proofs of false, Isabelle edition


view this post on Zulip Tristan Stérin (Mar 11 2026 at 16:28):

Hello,

Last week my Opus 4.6 setup found proofs of false in Rocq and Lean (see blog post).

@Lukas Stevens sent me overview.md describing areas of suspected vulnerabilities in Isabelle, and I've asked Opus to find bugs and potentially exploit them into proofs of false in the Isabelle codebase.

Here is what it found so far: REPORT.md (Using Isabelle2025-2)

Section 1 can be ignored as this was already known (KuncarPopescu_Exploit.thy) but there may be interesting bugs in Sections 3 and 2.

I still have access to the agent, let me know if it should be steered in one direction or another :)

view this post on Zulip Dmitriy Traytel (Mar 11 2026 at 16:47):

Thanks for looking into this. 3.3 could be a one place where it might be worth to try harder. If dependencies really can be lost, then potentially the KuncarPopescu example can be resurrected. If theory merges are to blame, than any counterexample would need to consist of multiple theories (probably with a diamond-shaped import structure).

view this post on Zulip Tristan Stérin (Mar 11 2026 at 16:59):

With pleasure! OK I will steer it in that direction (currently stucked on an anthropic login bug)

view this post on Zulip Tristan Stérin (Mar 11 2026 at 19:55):

Capture d’écran 2026-03-11 à 20.54.15.png

Here is its updated report: REPORT_UPDATED.md it may contain new leads

view this post on Zulip Dmitriy Traytel (Mar 12 2026 at 07:45):

Good for the merges! The only place I am aware of where unchecked definitions are explicitly used is Nominal1 (src/HOL/Nominal/Nominal.thy). (In contrast Nominal2 in the AFP is fully foundational.) It might makes sense to try to make sense of these particular unchecked definitions (and the checked ones building on top) and see if they can give rise to any problems.

view this post on Zulip Tristan Stérin (Mar 12 2026 at 10:34):

I forwarded your message and it updated its report but does not seem to have found anything new:
Capture d’écran 2026-03-12 à 11.32.44.png
REPORT.md

view this post on Zulip Lukas Stevens (Mar 13 2026 at 16:44):

Hi Tristan,
thanks for all your efforts! We from the Copilots for Isabelle project are currently out of ideas on what other details to look out for. If time and tokens permit, you could try to have a go with less specific input to avoid "overfitting", so you might just point it to the primitives in thm.ML, overloading and typedef.

view this post on Zulip Tristan Stérin (Mar 13 2026 at 16:52):

Yes of course! Launching now

view this post on Zulip Tristan Stérin (Mar 13 2026 at 17:13):

Capture d’écran 2026-03-13 à 18.12.43.png
Capture d’écran 2026-03-13 à 18.13.00.png

It did not try very hard but maybe enough that there is clear direction to nudge it into?

view this post on Zulip Lukas Stevens (Mar 17 2026 at 12:21):

Yes, that doesn't look very promising. I don't have any specific ideas. I am wondering whether it generated any code to test these claims, or if they just come from considering the implementation?

view this post on Zulip Tristan Stérin (Mar 17 2026 at 12:32):

It generated 28 test files:
isabelle_tests_archive.tar.gz


Last updated: Mar 20 2026 at 05:16 UTC