Hello,
Last week my Opus 4.6 setup found proofs of false in Rocq and Lean (see blog post).
@Lukas Stevens sent me overview.md describing areas of suspected vulnerabilities in Isabelle, and I've asked Opus to find bugs and potentially exploit them into proofs of false in the Isabelle codebase.
Here is what it found so far: REPORT.md (Using Isabelle2025-2)
Section 1 can be ignored as this was already known (KuncarPopescu_Exploit.thy) but there may be interesting bugs in Sections 3 and 2.
I still have access to the agent, let me know if it should be steered in one direction or another :)
Thanks for looking into this. 3.3 could be a one place where it might be worth to try harder. If dependencies really can be lost, then potentially the KuncarPopescu example can be resurrected. If theory merges are to blame, than any counterexample would need to consist of multiple theories (probably with a diamond-shaped import structure).
With pleasure! OK I will steer it in that direction (currently stucked on an anthropic login bug)
Capture d’écran 2026-03-11 à 20.54.15.png
Here is its updated report: REPORT_UPDATED.md it may contain new leads
Good for the merges! The only place I am aware of where unchecked definitions are explicitly used is Nominal1 (src/HOL/Nominal/Nominal.thy). (In contrast Nominal2 in the AFP is fully foundational.) It might makes sense to try to make sense of these particular unchecked definitions (and the checked ones building on top) and see if they can give rise to any problems.
I forwarded your message and it updated its report but does not seem to have found anything new:
Capture d’écran 2026-03-12 à 11.32.44.png
REPORT.md
Hi Tristan,
thanks for all your efforts! We from the Copilots for Isabelle project are currently out of ideas on what other details to look out for. If time and tokens permit, you could try to have a go with less specific input to avoid "overfitting", so you might just point it to the primitives in thm.ML, overloading and typedef.
Yes of course! Launching now
![]()
![]()
It did not try very hard but maybe enough that there is clear direction to nudge it into?
Yes, that doesn't look very promising. I don't have any specific ideas. I am wondering whether it generated any code to test these claims, or if they just come from considering the implementation?
It generated 28 test files:
isabelle_tests_archive.tar.gz
Last updated: Mar 20 2026 at 05:16 UTC