Stream: Mirror: Isabelle Users Mailing List

Topic: [isabelle] cvc4 prover always crashing


view this post on Zulip Email Gateway (Sep 02 2021 at 22:29):

From: "Dr A. Koutsoukou-Argyraki" <ak2110@cam.ac.uk>
Hi, I've noticed that lately every single time
I use Sledgehammer, cvc4 always returns the message "the prover
crashed".
e, z3, vampire either give me proofs or time out.

cvc4 simply never works and if my memory serves me well this
issue must have started with Isabelle 2021.

Any insights?

Many thanks
Best wishes,
Angeliki

view this post on Zulip Email Gateway (Sep 03 2021 at 00:15):

From: "Eugene W. Stark" <isabelle-users@starkeffect.com>
From my point of view, it is definitely the case that the Sledgehammer provers crash more frequently
and produce failed proofs more frequently in Isabelle2021 than they did in Isabelle2020. However,
taking the whole set of tools in Sledgehammer as an ensemble, much more amazing results are produced
in Isabelle2021 than in previous versions. The SMT provers can routinely produce "one-liners" that
replace tens of lines (or more) of written-out Isar code. (Of course, you have to throw away all the
failed proofs and take the ones that work.) This capability makes Sledgehammer extremely useful in
an approach to proof development where the human prover can make proposals for "big steps" in a proof,
with the expectation that if a proposed assertion is in fact true, then an SMT prover driven from'
Sledgehammer has a very good chance of being able to verify this.

My personal experience is that SMT-based proofs tend to be rather more fragile with respect to changes
in the underlying definitions that proofs based on the less-powerful methods, so once a proof has
been sketched out and verified with the help of Sledgehammer, it is well worth the extra effort to go
through and eliminate the uses of SMT and in the process identify reasoning patterns that can be
codified in simplification rules, etc. Nevertheless it is quite a boon to be able to verify a proof plan
in "big chunks" and then come back and refine it, knowing already that it has been verified that
the proof is correct in outline.

For me, the most annoying new behavior in Sledgehammer in Isabelle2021 is the tendency in certain
situations to claim to have found a proof and then to spit out Isar code that doesn't even "show"
the lemma that was supposed to have been proved. This is apparently some kind of issue in replaying
in the proof kernel the results of the SMT searches, but as I have very little understanding of
how that all works, I will stop there. Would it be so difficult to ensure that any proposed Isar
proofs really do "show" the lemma that they are supposed to be showing? I tried early on after
the release of Isabelle2021 to capture examples of this, but as it didn't seem like it was helping,
I gave up on that for the moment.

I also frequently see failures of the SMT solvers with segmentation faults. These no doubt
represent programming errors in the SMT solvers. The wisdom of an LCF-style system with a
"trusted kernel" proves itself over and over again by making it possible to accomplish something
useful with very powerful, but potentially buggy, tools.

I find it generally very difficult "in the heat of the moment" when working on a theory
of several thousand lines that depend on a context of tens of thousands lines of other theories,
to capture useful repeatable instances of the various ways in which Sledgehammer fails.
The randomness in the Sledgehammer procedures, which I think probably derives not only from
randomness in the SMT solvers themselves, but also from fact caching mechanisms that at
work in Isabelle, makes repeatability problematic in many cases. I would be interested to hear
from the developers any advice as to how I might be able to provide them with useful failure
cases to help reduce the number of bugs overall.

Finally, the timing figures produced by Sledgehammer currently have very little connection to
reality, in my opinion. I generally select a particular proposed proof based on the length and
content of the list of facts used, rather than paying too much attention to the timing figures.
The discrepancy between the timings reported by Sledgehammer when the proof is proposed and
the time available from Isabelle/JEdit seems to have become larger with Isabelle2021, however
at around the time of the release of Isabelle2021 I started to use a 10-core processor for
development, so it is possible that the increased discrepancies have something to do with tallying
the time used by the larger number of threads.

- Gene Stark


Last updated: Jan 04 2025 at 20:18 UTC