Stream: Archive Mirror: Isabelle Users Mailing List

Topic: [isabelle] Minimizer and preplay - 2 of the many amazing ...


view this post on Zulip Email Gateway (Aug 19 2022 at 08:40):

From: Gottfried Barrow <gottfried.barrow@gmx.com>
Hi,

I've spent a lot of time studying how Sledgehammer works with the 25 or
so ATPs that can be enabled.

About the time I decide I'm spending too much time experimenting with
it, I get some new understanding of its feedback in the output panel,
and learn how to use another of its many amazing features.

So I used the "dont_preplay" option, because Sledgehammer was taking
time to replay the proofs that the ATPs had found, to test them for its
own purposes, which takes time, and the proofs it has found were all
working, so I decided it didn't need to test them.

But there was this one metis proof that was staying purple an annoyingly
long time as Isabelle clicked through the proofs of my theory, staying
purple for 2 or 3 seconds. That's a long time when every other proof is
only taking a half a second or less.

The short story is that, not understanding the timing information
Sledgehammer was giving me due to preplay, I had picked the 3 second
proof instead of one of the half second proofs it had offered.

Eventually, I discovered the use of the timing information due to
preplay. Does Sledgehammer return 10 different metis proofs? Now, I
don't just pick the one using the shortest named facts, I look at how
long Sledgehammer took to preplay the proof.

Okay, I've learned to look at more than that. Are there ten proofs
offered? I look at the theorems metis uses and I ask the question, "If I
was going to do a detailed proof using the theorems metis uses, which
different set of facts used would be the most logically appealing, and
require fewer steps?

But due to parallel processing and the timeout setting, there are lots
of differences in the proofs Sledgehammer will find.

Sledgehammer was returning 10 or so metis proofs that were using
numerous facts, and it was offering to minimize them, which is when
Sledgehammer feeds back a proof to the ATP using fewer and fewer facts.

I said, "Hey, manually going through 7 minimize attempts is not what I
call automation, so I'll set 'minimize=true'". But there's always a
trade off.

After going back and running Sledgehammer on theorems with the preplay
option on, for one theorem, it was only finding one proof, rather than
the usual many proofs, and it was remote_z3_tptp that was finding the proof.

This brings up the issue of nuance, one being that, given a certain
timeout setting, sometimes a remote duplicate of a local prover will
return a proof that the local prover doesn't.

I had also started asking the question, "Do I really need z3_tptp in
there?" Well, obviously, yes.

Since I wasn't getting lots of proofs, I disabled "minimize" and used
"dont_preplay". I then got proofs for z3_tptp, remote_z3_tptp, z3, and
remote_z3. The time spent for the 4 provers was like this:

remote_z3_tptp: ATP real CPU time: 2.15 s.
z3_tptp: ATP real CPU time: 29.79 s.
z3 SMT solver real CPU time: 2.22 s
remote_z3 SMT solver real CPU time: 5.75 s.

More nuances. Sometimes the remote prover is faster than the local
prover, and sometimes not. Last week, I didn't use remote provers that
are duplicates of local provers, but now I generally do.

There's more for me to tell about how to use Sledgehammer's information
than most people have time to read.

For example, z3_tptp gave me a proof like this:

by (metis extension_equality unordered_pair_axiom ordered_pair_def)
(406 ms)

My extension_equality theorem is merely a biconditional of my
extension_axiom. I was asking these questions, "Do I really need
extension_equality? Will the proof work using extension_axiom instead,
and will it be slower if it does?"

Well, the Sledgehammer output shows you how to use the "min" option, so
I tried this, not to minimize the number of facts, but to get the timing
information with the axiom substituted for the theorem:

sledgehammer min [z3_tptp, timeout = 90, verbose] (extension_axiom
unordered_pair_axiom ordered_pair_def),

and I got this:

Try this: by (metis extension_axiom unordered_pair_axiom
ordered_pair_def) (408 ms).

So metis only uses 2 more milliseconds when it uses the axiom directly.

I finish out most of this this email with my maxed out sledgehammer
command, and the reasoning behind it.

With Sledgehammer, you need a good process viewer to see what it's
doing, and a good CPU meter in your task bar to see if your CPUs are
running heavily when they shouldn't be. If you don't do things right,
processes that are taking up 25% of your CPU won't get terminated. Doing
things right is letting Sledgehammer finish normally, or deleting it
before you close jEdit. Sometimes processes just don't get terminated
when you don't let Sledgehammer finish.

Win CPU meter: http://www.hexagora.com/en_dw_davperf.asp
Win Process Explorer:
http://technet.microsoft.com/en-us/sysinternals/bb896653.aspx

My long version is this, and I explain what's not arbitrary about it:

sledgehammer[DEL,dont_preplay,minimize,timeout=90,verbose,max_relevant=smart,provers="
metis remote_vampire e remote_z3 spass_new remote_e_sine z3_tptp
remote_e_tofof remote_e remote_z3_tptp remote_cvc3 remote_iprover
remote_iprover_eq remote_waldmeister satallax z3 cvc3 yices"]

1) I use a macro to insert it, and I put the DEL in to cause an error so
it won't run. The "DEL,dont_preplay," is highlighted, and I press the
backspace key to delete it, after which Sledgehammer starts running.

2) I want "preplay" to get the timing information, but I could decide to
not to do "preplay" so that Sledgehammer is devoting more time to only
finding proofs, in case it has been timing out and not getting many proofs.

3) I normally want to automate "minimize", but if it's timing out and
not getting many proofs, I'll delete it before I run Sledgehammer.

4) I could set "max_relevant=1000" so that Sledgehammer is feeding the
ATPs massive facts, but I've learned that if the ATPs have to work too
hard, then I'm trying to prove a false statement, or I need to add some
preliminary lemmas. Quite possibly, a proof that takes an ATP a long
time to find will cause the Isabelle prover to work way too hard and
long once it's given the proof.

5) Only a maximum of 4 provers will run in parallel, and they run in the
order given, so I put the 7 better provers first, and I have a
reasonable timeout set so that the first 4 provers don't hold everything
else up too long. Many times the first four will finish fast anyway, and
again, if a particular prover isn't proving anything reasonably fast,
then it won't or it won't find a decently fast proof.

6) metis is first, because though it can get overloaded and kick out
fast, it can also return proofs very fast; metis is run directly by
poly.exe.

7) The vampire binary is hard to get, the Windows version doesn't run
right, and the remote_vampire is twice as fast as my Linux vampire, but
vampire one of the best of the best, so I put it second.

8) "e" is an execllent prover, so I put the local e in the first group.
I also put remote_e in with the remote group of provers. Processing
nuances could result in remote_e returning a proof that the local e
doesn't. Local provers can fail to return proofs that remote provers
return because parallel processing is never the same, especially when
the options and provers listed are different. But I try these 20 or so
provers on the first try to try and save time; different provers find
proofs for different theorems, and always because they weren't given
enough time. If I don't get such good results on this massive try, then
I can start tweaking options.

9) remote_z3 is before z3 because z3, when used directly, won't run in
parallel. It will only run after all other provers listed before it have
finished. However, it's an execellent prover, which is why I want
remote_z3 to start running in the first 4. Additionally, I leave out the
"smt" prover; smt calls z3, and z3 used directly finds better proofs,
and finds them just as fast.

10) "z3_tptp" uses the z3-prover.exe just like z3, but it's not used the
same, so it finds proofs when the others don't, as shown by my example
above.

11) I then let Sledgehammer go through all the other remote provers,
other than remote_leo2 and remote_satallax. These are HOL provers, and
I'm almost exclusively doing FOL. I wonder about leaving remote_iprover
and remote_iprover_eq in. They don't prove a lot, but I leave them in.

12) I then do the local satallax. Maybe it's faster or slower at
finishing than the remote_satallax, but remote provers can sometimes
hold things up; a 60 second timeout might get processed remotely over a
period of much more than 60 seconds.

13) z3, cvc3, and yices won't run in parallel with other provers, so I
put them last, not that it would matter. If I put them first, they still
won't start until the other provers are finished. I put z3 first,
because they will run in the order listed.

Thanks to Larry Paulson and Jasmin Blanchette for writing an amazing
piece of software. Jasmin has helped me out a lot, and I give you some
links to articles Jasmin gave me. He says,

For many of my papers and in my Ph.D. thesis, I've evaluated E, iProver, SPASS, Vampire, CVC3, Yices, Z3, LEO-II, and Satallax. See e.g.

http://www21.in.tum.de/~blanchet/jar-smt.pdf (Section 6)
http://www21.in.tum.de/~blanchet/testbench.pdf (Section 5)
http://www21.in.tum.de/~blanchet/enc_poly_paper.pdf (Section 6)
http://www21.in.tum.de/~blanchet/itp2012-spass.pdf (Section 7)
http://www21.in.tum.de/~blanchet/iwil2010-sledgehammer.pdf (Section 3)

and of course the original "Judgement Day" paper by Böhme& Nipkow:

http://www21.in.tum.de/~boehmes/judgement/judgement.pdf

Hence I have a very good idea, based on thousands of data points, of how
[message truncated]


Last updated: Nov 21 2024 at 12:39 UTC