Stream: Mirror: Isabelle Users Mailing List

Topic: [isabelle] Heavy memory use with sledgehammer


view this post on Zulip Email Gateway (Apr 26 2023 at 14:03):

From: Ghilain Bergeron <ghilain.bergeron@inria.fr>
Hello.

I am having a bit of a problem with sledgehammer.
It seems that sometimes, the memory it used isn't being reclaimed (through GC, I guess).
This leads me to having very high memory usage (e.g. a polyml process with 16GB of RAM used) and the
longer it happens for, the more likely it is that sledgehammer is going to crash the LSP server
(note: it also happens within the JEdit interface).
Does anyone have any idea how to try and debug this? Or is this expected behavior when running
sledgehammer often?

Thanks in advance,
Ghilain.

view this post on Zulip Email Gateway (Apr 26 2023 at 14:12):

From: Dominic Mulligan <cl-isabelle-users@lists.cam.ac.uk>
Hi,

To add another observation: I, along with multiple colleagues, have also
noticed a similar and perhaps related effect with Isabelle2022 on Apple M1
Macs. After prolonged use Isabelle performance seems to significantly
degrade, including that of Sledgehammer, to the point where it struggles to
find proofs for anything and even invoking the external provers seems to
take an age. Killing Isabelle and restarting seems to fix the problem.

Thanks,
Dominic

view this post on Zulip Email Gateway (Apr 26 2023 at 14:20):

From: Jasmin Blanchette <jasmin.blanchette@ifi.lmu.de>
Dear Ghilain,

This is definitely not the intended behavior. I suspect the issue might be related to your platform. Are you using Windows? Also, do you know which prover(s) this happens with, e.g. by investigating the list of running processes?

In principle, provers are invoked with a soft and a hard timeout, so they should terminate, but there has been reports before of provers keeping on running.

Best,
Jasmin

view this post on Zulip Email Gateway (Apr 26 2023 at 14:28):

From: Ghilain Bergeron <ghilain.bergeron@inria.fr>
Hi Jasmin!

I am using NixOS version 22.11, with Isabelle2022. There does not seem to be any prover left running
when sledgehammer finishes, according to my system monitor.
However, all I can see is a "poly" process slowly filling up my RAM each time I call sledgehammer, although
from time to time there seems to be a GC run (not frequently at all; currently the process takes 13.5GB of my RAM
and I ran sledgehammer like 5 times on a fairly complex goal).
From what I can see, here's the list of provers running on my machine: z3, cvc4, SPASS, veriT, vampire,
eprover-ho and zipperposition.

I also have noticed the same issue that Dominic Mulligan is reporting (in a follow-up mail), regarding degrading performance
due to prolonged Isabelle use. Not sure if this is relevant to this problem though.

Regards,
Ghilain

view this post on Zulip Email Gateway (Apr 26 2023 at 15:38):

From: "Eugene W. Stark" <isabelle-users@starkeffect.com>
To add to this: it is my experience and belief that, under certain modes of use, Sledgehammer causes
Poly/ML to leak heap space, resulting in the heap utilization eventually bumping up against the limit
and causing frequent garbage collection and poor performance. At that point, restarting Isabelle is
necessary to restore performance.

This occurs for me on Linux (Ubuntu 22.04). It might be related to the long-standing issue with some
of the external provers not handling signals correctly and popping up notifications that the external
program has crashed.

- Gene Stark

view this post on Zulip Email Gateway (Apr 27 2023 at 07:09):

From: Jasmin Blanchette <jasmin.blanchette@ifi.lmu.de>
I suspect the issues are related to the Poly/ML platform or to the Isabelle/ML infrastructure. Perhaps Makarius can comment on this thread?

Best,
Jasmin

view this post on Zulip Email Gateway (Apr 27 2023 at 09:29):

From: Makarius <makarius@sketis.net>
We need something reproducible to work with.

Generally note that ARM64 is still not fully supported: On macOS Intel
binaries work out of the box, but I don't know if and how the implicit
translation via Rosetta 2 can interfere with run times and interruptibility.

Makarius

view this post on Zulip Email Gateway (Apr 27 2023 at 09:32):

From: Makarius <makarius@sketis.net>
On 26/04/2023 16:28, Ghilain Bergeron wrote:

I am using NixOS version 22.11, with Isabelle2022. There does not seem to be
any prover left running
when sledgehammer finishes, according to my system monitor.

NixOS is not on the list of officially supported Isabelle platforms, see also
https://isabelle.in.tum.de/installation.html

So it must be a package provided privately by someone else (Who should also
take responsiblitity for it).

Can you provide a source for that? I would be interested so see how it was
done, to say if there is a chance of it working properly.

However, all I can see is a "poly" process slowly filling up my RAM each time
I call sledgehammer, although
from time to time there seems to be a GC run (not frequently at all; currently
the process takes 13.5GB of my RAM
and I ran sledgehammer like 5 times on a fairly complex goal).

Going towards 16 GB is always critical, because ML GC times are getting long
and can interfere with the interaction. 16 GB is a hard limit of the default
Poly/ML platform.

It might be better to set ML_system_64 = "true" in
$ISABELLE_HOME_USER/etc/preferences (while the PIDE is not running) and try
with a machine that has 64 GB, or so.

Anyway, are these proof problems somewhere publicly available, e.g. on
Isabelle/AFP?

Makarius

view this post on Zulip Email Gateway (Apr 27 2023 at 09:41):

From: Makarius <makarius@sketis.net>
On 26/04/2023 17:30, Eugene W. Stark wrote:

To add to this: it is my experience and belief that, under certain modes of
use, Sledgehammer causes
Poly/ML to leak heap space, resulting in the heap utilization eventually
bumping up against the limit
and causing frequent garbage collection and poor performance.  At that point,
restarting Isabelle is
necessary to restore performance.

As you have plenty of public examples on Isabelle/AFP, can you find a
situation where it is somehow reproducible?

We also need a specification of the hardware resources (RAM, CPU cores).

It might be related to the
long-standing issue with some
of the external provers not handling signals correctly and popping up
notifications that the external
program has crashed.

What exactly is the signal / return code of the crash?

Maybe this is just an out-of-memory situation in Linux.

Makarius

view this post on Zulip Email Gateway (Apr 27 2023 at 09:44):

From: Ghilain Bergeron <ghilain.bergeron@inria.fr>
The original package for Isabelle2021 is on nixpkgs: [ https://github.com/NixOS/nixpkgs/blob/nixos-22.11/pkgs/applications/science/logic/isabelle/default.nix | https://github.com/NixOS/nixpkgs/blob/nixos-22.11/pkgs/applications/science/logic/isabelle/default.nix ]
However, I have modified it a bit for Isabelle2022: [ https://github.com/Mesabloo/nix-config/blob/master/extra/nix-overlays/packages/isabelle.nix | https://github.com/Mesabloo/nix-config/blob/master/extra/nix-overlays/packages/isabelle.nix ]
Note that I am not using Poly/ML from nixpkgs either: [ https://github.com/Mesabloo/nix-config/blob/master/extra/nix-overlays/packages/polyml.nix | https://github.com/Mesabloo/nix-config/blob/master/extra/nix-overlays/packages/polyml.nix ] as it doesn't have the option "--enable-intinf-as-int", which makes it so that Isabelle cannot be built. Perhaps I am missing some other configuration flag?

My "custom" Nix script for Isabelle uses the Isabelle installation in [ https://github.com/m-fleury/isabelle-emacs | https://github.com/m-fleury/isabelle-emacs ] although this should change close to nothing regarding Isabelle itself, as it mainly is a mirror + some modifications to the LSP server + a LSP mode for emacs.

It might be better to set ML_system_64 = "true" in
$ISABELLE_HOME_USER/etc/preferences (while the PIDE is not running) and try
with a machine that has 64 GB, or so.

Unfortunately, I don't have access to such a machine, only 32 GB of RAM.

Anyway, are these proof problems somewhere publicly available, e.g. on
Isabelle/AFP?

Currently they aren't, and I don't know how long they won't be. However, it shouldn't matter much, as long as the project is not very small, and the proofs are not easy to find for sledgehammer.

Cheers,
Ghilain.

De: "Makarius" <makarius@sketis.net>
À: "Ghilain Bergeron" <ghilain.bergeron@inria.fr>, "Jasmin Blanchette" <jasmin.blanchette@ifi.lmu.de>
Cc: "cl-isabelle-users" <cl-isabelle-users@lists.cam.ac.uk>
Envoyé: Jeudi 27 Avril 2023 11:25:54
Objet: Re: [isabelle] Heavy memory use with sledgehammer

BQ_BEGIN
On 26/04/2023 16:28, Ghilain Bergeron wrote:

BQ_BEGIN

I am using NixOS version 22.11, with Isabelle2022. There does not seem to be
any prover left running
when sledgehammer finishes, according to my system monitor.
BQ_END

NixOS is not on the list of officially supported Isabelle platforms, see also
https://isabelle.in.tum.de/installation.html

So it must be a package provided privately by someone else (Who should also
take responsiblitity for it).

Can you provide a source for that? I would be interested so see how it was
done, to say if there is a chance of it working properly.

BQ_BEGIN
However, all I can see is a "poly" process slowly filling up my RAM each time
I call sledgehammer, although
from time to time there seems to be a GC run (not frequently at all; currently
the process takes 13.5GB of my RAM
and I ran sledgehammer like 5 times on a fairly complex goal).
BQ_END

Going towards 16 GB is always critical, because ML GC times are getting long
and can interfere with the interaction. 16 GB is a hard limit of the default
Poly/ML platform.

It might be better to set ML_system_64 = "true" in
$ISABELLE_HOME_USER/etc/preferences (while the PIDE is not running) and try
with a machine that has 64 GB, or so.

Anyway, are these proof problems somewhere publicly available, e.g. on
Isabelle/AFP?

Makarius
BQ_END

view this post on Zulip Email Gateway (Apr 27 2023 at 09:50):

From: Jan van Brügge <jan@vanbruegge.de>

Can you provide a source for that? I would be interested so see how it was done, to say if there is a chance of it working properly.

I am packaging Isabelle for NixOS. The packaging instructions are here: https://github.com/NixOS/nixpkgs/blob/master/pkgs/applications/science/logic/isabelle/default.nix#L88

To summarize: I take the official Isabelle release, replace the bundled polyml with a version built by nix (but based off the same commit as the Isabelle release), and patch the bundled ATPs so they can find libc and other dependencies.

Apr 27, 2023 10:32:23 AM Makarius <makarius@sketis.net>:

On 26/04/2023 16:28, Ghilain Bergeron wrote:

I am using NixOS version 22.11, with Isabelle2022. There does not seem to be any prover left running
when sledgehammer finishes, according to my system monitor.

NixOS is not on the list of officially supported Isabelle platforms, see also https://isabelle.in.tum.de/installation.html

So it must be a package provided privately by someone else (Who should also take responsiblitity for it).

Can you provide a source for that? I would be interested so see how it was done, to say if there is a chance of it working properly.

However, all I can see is a "poly" process slowly filling up my RAM each time I call sledgehammer, although
from time to time there seems to be a GC run (not frequently at all; currently the process takes 13.5GB of my RAM
and I ran sledgehammer like 5 times on a fairly complex goal).

Going towards 16 GB is always critical, because ML GC times are getting long and can interfere with the interaction. 16 GB is a hard limit of the default Poly/ML platform.

It might be better to set ML_system_64 = "true" in $ISABELLE_HOME_USER/etc/preferences (while the PIDE is not running) and try with a machine that has 64 GB, or so.

Anyway, are these proof problems somewhere publicly available, e.g. on Isabelle/AFP?

Makarius

view this post on Zulip Email Gateway (Apr 27 2023 at 10:12):

From: Makarius <makarius@sketis.net>
External processes are actually managed in Isabelle/Scala: Isabelle/ML makes
an explicit protocol invocation it for that, e.g. see
https://isabelle.sketis.net/repos/isabelle-release/file/tip/src/Pure/System/isabelle_system.ML

This already supports a timeout parameter, but historically most Isabelle/ML
applications (including Sledgehammer) use Timeout.apply on the ML side: which
is also subject to GC times.

For heavy computations within Isabelle/ML it is often better to use
Timeout.apply_physical, e.g. the change
https://isabelle.sketis.net/repos/isabelle-release/rev/d54b3c96ee50 --- but
this is probably not relevant for Sledgehammer.

Below is a minimal example to show the difference of ML vs. Scala timeouts,
even including a protocol trace. For the latter, you need to edit
$ISABELLE_HOME_USER/etc/preferences before starting the Prover IDE as follows:

bash_process_debugging = "true"

Then you start "isabelle jedit" on the terminal and process the subsequent ML
snippets in some theory context as usual:

ML ‹val script = "sleep 10"; val timeout = seconds 2.0;›

text ‹timeout managed in Isabelle/ML:›
ML ‹Timeout.apply timeout Isabelle_System.bash_process (Bash.script script)›

text ‹timeout managed in Isabelle/Scala:›
ML ‹Isabelle_System.bash_process (Bash.script script |> Bash.timeout timeout)›

(*
11:55:34 AM [Isabelle.client] [error] client: start "bash_process"
(uuid=66aac7fa-2db8-411a-8fd2-7b2bcefe1e6b, timeout=0.0)
11:55:36 AM [Isabelle.client] [error] client: kill
66aac7fa-2db8-411a-8fd2-7b2bcefe1e6b
11:55:36 AM [Isabelle.client] [error] client: start "bash_process"
(uuid=fdcf8608-ca94-46b4-9a6c-3d1416ed2755, timeout=2.0)
11:55:36 AM [Isabelle.bash_process] [error] bash_process: stop "bash_process"
(uuid=66aac7fa-2db8-411a-8fd2-7b2bcefe1e6b, return_code=130)
11:55:38 AM [Isabelle.bash_process] [error] bash_process: stop "bash_process"
(uuid=fdcf8608-ca94-46b4-9a6c-3d1416ed2755, return_code=142)
11:55:38 AM [Isabelle.client] [error] client: kill
fdcf8608-ca94-46b4-9a6c-3d1416ed2755
*)

Return code 130 is SIGINT, originating from the Isabelle/ML implementation of
Timeout.apply.

Return code 142 is SIGALRM, originating from Isabelle/Scala --- not the
external process. See also
https://isabelle.sketis.net/repos/isabelle-release/file/tip/src/Pure/System/bash.scala#l322

Sledgehammer might benefit from moving the timeout from ML into Scala, because
that is closer to the actual process and the "watchdog" is not subject to the
health of the ML runtime system.

It is also interesting to run Sledgehammer examples with the above system
options bash_process_debugging = "true" to see rather mixed return codes.

(One day, Isabelle/PIDE will include a GUI panel for bash processes, but not now.)

Makarius

view this post on Zulip Email Gateway (Apr 27 2023 at 10:22):

From: "Eugene W. Stark" <isabelle-users@starkeffect.com>
[Resend to list only, because my previous send had a From address other than the subscribed one.]

On 4/27/23 05:31, Makarius wrote:

On 26/04/2023 17:30, Eugene W. Stark wrote:

To add to this: it is my experience and belief that, under certain modes of use, Sledgehammer causes
Poly/ML to leak heap space, resulting in the heap utilization eventually bumping up against the limit
and causing frequent garbage collection and poor performance.  At that point, restarting Isabelle is
necessary to restore performance.

As you have plenty of public examples on Isabelle/AFP, can you find a situation where it is somehow reproducible?

We also need a specification of the hardware resources (RAM, CPU cores).

I currently experience this on the following platform:

Intel i9-10850K (10 cores)
128GB RAM
JEDIT_JAVA_OPTIONS="-Xmx16384m"
ML_OPTIONS="--maxheap 64G"
Ubuntu 22.04:
Linux 5.19.0-40-generic #41~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Fri Mar 31 16:00:14 UTC 2 x86_64 x86_64 x86_64 GNU/Linux

There isn't a cookbook recipe for reproducing this. The problem typically occurs after a long session
with heavy use of "try" and/or the Sledgehammer panel.

To my mind, the problem is consistent with some kind of situation where weak references or the like
are created into the ML heap as a result of running the external provers, and then these references are
sometimes lost, resulting in a leak over time. However, I have no specific knowledge of how the
external provers are interfaced with ML. If there in fact are some kind of weak references of this
nature, I would think that instrumenting their creation and reclamation could potentially
indicate if there are leaks involving them.

It might be related to the long-standing issue with some
of the external provers not handling signals correctly and popping up notifications that the external
program has crashed.

What exactly is the signal / return code of the crash?

Maybe this is just an out-of-memory situation in Linux.

No, it is not an out-of-memory situation. We have discussed this issue in this forum some years ago.
I did some investigation at that time and found that some of the provers under certain situations do
not properly handle signals sent to them to cause them to terminate, resulting in the OS kernel flagging
an abnormal termination. Under Ubuntu, at least, this causes a pop-up notification
("A system program has failed", or the like). Frequently after an extended session using Sledgehammer
I have to clear several (sometimes many) of these notifications, which accumulate underneath the
application windows. In at least one case, I did trace the origin of the problem to a signal handler
that was either part of the external prover code or part of some "glue" code that interfaced the Isabelle
system to that external prover -- I don't remember which. I think the signal involved might have been
SIGABRT, but again I don't recall for sure at this time. It was a signal whose default disposition
is to cause a core dump, because otherwise a notification would not be generated by the OS.

If I recall correctly, the result of the previous thread and my own investigation was that
"It is the fault of the external prover implementers" and that my problem report should be directed at
them. Getting involved in that was not consistent with my own goals in using the system, so I did not
pursue it further at that time, but the problem still exists. The reason I mentioned this again in
the current thread is that I speculate that the abnormal termination of an external prover in this way
could be the origin of leaked storage in the ML heap.

- Gene Stark

view this post on Zulip Email Gateway (Apr 27 2023 at 10:26):

From: Makarius <makarius@sketis.net>
Note that "Isabelle itself" is everything that we bundle in the official
release. This also includes OpenJDK and Scala, now even VSCode/Electron/Node.js.

There are reasons for that: over the decades, strange problems have occurred
with arbitrary versions of such system components seen in the wild.

So a proper version of the NixOS package would take everything from the
official Isabelle download, without any censorship.

Isabelle actually shares the approach of NixOS to include just one specific
version of (almost) everything.

Makarius

view this post on Zulip Email Gateway (Apr 27 2023 at 10:33):

From: Makarius <makarius@sketis.net>
That is an interesting high-end configuration.

From a distance, I would say that ML GC times become so long that the overall
model of Sledgehammer timeout no longer fits.

It could be resolved by moving timeouts from ML to Scala, as I have sketched
earlier.

Makarius

view this post on Zulip Email Gateway (Apr 27 2023 at 10:35):

From: Makarius <makarius@sketis.net>
This is very complex. I am not going to study nor to review it.

It is your responsible to sort out problems stemming from the disassembly of
all Isabelle components and reassembly in a slightly different manner.

Makarius

view this post on Zulip Email Gateway (Apr 27 2023 at 10:42):

From: Dominique Unruh <cl-isabelle-users@lists.cam.ac.uk>
Hi,

I also have experienced over the years (i.e., in multiple versions) that
there Isabelle tends to slow down a lot after heavy use (involving lots
of sledgehammers, interrupted and restarted sledgehammers due to edits
further up in the code, rerunning proofs (again, because of editing
fragments higher than the bottom of the visible part of the document).

The problem is that this is not something that can be easily reproduced.
The theory file we have at that point is not to blame (restarting
Isabelle fixes the problem on the same file). It is more the interaction
that somehow leads to a problem.

I have the following observations:

* It's not jEdit that's stuck (because unloading the plugin and
reloading it tends to resolve the problem).

* There seems to be frequent GC once Isabelle is in this state.
* It does not seem to be due to the load of the running sledghammers
itself (because sometimes I have waited a long time to give it a
chance to finish with whatever it is doing, to no avail).

* Sometimes, it's not just a slowdown but Isabelle completely stops
doing anything (one can edit, but colors/output tab are not
updated). However, I don't know if it's the same issue.

Since it is almost impossible to produce something reproducible here
except something like "work for a few hours and intensively edit and run
sledgehammer or try after every goal", I think what we need is the
following:

Is there some way to produce some post-mortem analysis? I.e., next time
I observe this, would there be something I can do to get a core-dump or
thread-dump or whatever that would be helpful (for Makarius, I assume)
to diagnose what is happening or at least what further inquiries need to
be made?

Best wishes,
Dominique.

view this post on Zulip Email Gateway (Apr 27 2023 at 10:55):

From: "Eugene W. Stark" <isabelle-users@starkeffect.com>
The quoted message mentions some things that I have observed that I consider to be (potentially)
separate issues from the main one that this thread has been addressing. These are:

(1) An instability that occurs in which Isabelle queues up tens or hundreds of thousands
of "Future Tasks" (as indicated by the monitor panel), which do not get cancelled
even after the "Cancel" button on the Sledgehammer panel has been pressed or the
"try" has been deleted from the JEdit editing buffer. This situation can take
many minutes (or longer) to resolve itself. This behavior is also long-standing,
though it does not occur as frequently as it did a few years ago (from my point of
view).

(2) Extreme time taken by Poly/ML for an apparent major GC. "ML Cleanup" shows in the
JEdit window and although JEdit remains responsive, nothing gets updated that
requires output from ML. This situation can persist for many minutes or longer.
Sometimes I have gone away for perhaps an hour and come back to find it still
in that state, whereupon I give up and just restart Isabelle.

It is possible, I suppose that this is "expected" behavior; however, in some
cases it has more the appearance of a deadlock situation.

- Gene Stark

view this post on Zulip Email Gateway (Apr 27 2023 at 10:57):

From: "Eugene W. Stark" <isabelle-users@starkeffect.com>
[CC to original recipients]

The quoted message mentions some things that I have observed that I consider to be (potentially)
separate issues from the main one that this thread has been addressing. These are:

(1) An instability that occurs in which Isabelle queues up tens or hundreds of thousands
of "Future Tasks" (as indicated by the monitor panel), which do not get cancelled
even after the "Cancel" button on the Sledgehammer panel has been pressed or the
"try" has been deleted from the JEdit editing buffer. This situationcan take
many minutes (or longer) to resolve itself. This behavior is also long-standing,
though it does not occur as frequently as it did a few years ago (from my point of
view).

(2) Extreme time taken by Poly/ML for an apparent major GC. "ML Cleanup" shows in the
JEdit window and although JEdit remains responsive, nothing gets updated that
requires output from ML. This situation can persist for many minutesor longer.
Sometimes I have gone away for perhaps an hour and come back to find it still
in that state, whereupon I give up and just restart Isabelle.

It is possible, I suppose that this is "expected" behavior; however, in some
cases it has more the appearance of a deadlock situation.

- Gene Stark

view this post on Zulip Email Gateway (Apr 27 2023 at 12:25):

From: Makarius <makarius@sketis.net>
I would say it is better to do live monitoring with more details: ML tasks, ML
threads, bash processes shown in the Prover IDE as some kind of "task manager".

Overall this is not a single program, but a parallel system consisting of many
different components.

Makarius

view this post on Zulip Email Gateway (Apr 27 2023 at 12:27):

From: Dominique Unruh <cl-isabelle-users@lists.cam.ac.uk>
Hi,
Of course. But I cannot interpret those things myself, so the question
is: what of these would I send to the mailing list in case of a partly
stuck Isabelle?

Best wishes,
Dominique.

view this post on Zulip Email Gateway (Apr 27 2023 at 13:07):

From: Makarius <makarius@sketis.net>
A text snapshot of such a live monitor state.

Makarius

view this post on Zulip Email Gateway (Apr 30 2023 at 10:13):

From: Lex Bailey <cl-isabelle-users@lists.cam.ac.uk>
I have experienced the same on an M1 Mac, additionally I have found that
other applications are affected.
Music stutters when using sledgehammer, and eventually the MacOS OOM killer
takes action (although it normally kills other programs before Isabelle,
curiously)

view this post on Zulip Email Gateway (Apr 30 2023 at 10:13):

From: "Eugene W. Stark" <stark@cs.stonybrook.edu>
On 4/27/23 05:31, Makarius wrote:

On 26/04/2023 17:30, Eugene W. Stark wrote:

To add to this: it is my experience and belief that, under certain modes of use, Sledgehammer causes
Poly/ML to leak heap space, resulting in the heap utilization eventually bumping up against the limit
and causing frequent garbage collection and poor performance.  At that point, restarting Isabelle is
necessary to restore performance.

As you have plenty of public examples on Isabelle/AFP, can you find a situation where it is somehow reproducible?

We also need a specification of the hardware resources (RAM, CPU cores).

I currently experience this on the following platform:

Intel i9-10850K (10 cores)
128GB RAM
JEDIT_JAVA_OPTIONS="-Xmx16384m"
ML_OPTIONS="--maxheap 64G"
Ubuntu 22.04:
Linux 5.19.0-40-generic #41~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Fri Mar 31 16:00:14 UTC 2 x86_64 x86_64 x86_64 GNU/Linux

There isn't a cookbook recipe for reproducing this. The problem typically occurs after a long session
with heavy use of "try" and/or the Sledgehammer panel.

To my mind, the problem is consistent with some kind of situation where weak references or the like
are created into the ML heap as a result of running the external provers, and then these references are
sometimes lost, resulting in a leak over time. However, I have no specific knowledge of how the
external provers are interfaced with ML. If there in fact are some kind of weak references of this
nature, I would think that instrumenting their creation and reclamation could potentially
indicate if there are leaks involving them.

It might be related to the long-standing issue with some
of the external provers not handling signals correctly and popping up notifications that the external
program has crashed.

What exactly is the signal / return code of the crash?

Maybe this is just an out-of-memory situation in Linux.

No, it is not an out-of-memory situation. We have discussed this issue in this forum some years ago.
I did some investigation at that time and found that some of the provers under certain situations do
not properly handle signals sent to them to cause them to terminate, resulting in the OS kernel flagging
an abnormal termination. Under Ubuntu, at least, this causes a pop-up notification
("A system program has failed", or the like). Frequently after an extended session using Sledgehammer
I have to clear several (sometimes many) of these notifications, which accumulate underneath the
application windows. In at least one case, I did trace the origin of the problem to a signal handler
that was either part of the external prover code or part of some "glue" code that interfaced the Isabelle
system to that external prover -- I don't remember which. I think the signal involved might have been
SIGABRT, but again I don't recall for sure at this time. It was a signal whose default disposition
is to cause a core dump, because otherwise a notification would not be generated by the OS.

If I recall correctly, the result of the previous thread and my own investigation was that
"It is the fault of the external prover implementers" and that my problem report should be directed at
them. Getting involved in that was not consistent with my own goals in using the system, so I did not
pursue it further at that time, but the problem still exists. The reason I mentioned this again in
the current thread is that I speculate that the abnormal termination of an external prover in this way
could be the origin of leaked storage in the ML heap.

- Gene Stark

OpenPGP_signature

view this post on Zulip Email Gateway (May 02 2023 at 00:28):

From: "Eugene W. Stark" <isabelle-users@starkeffect.com>
Since doing live monitoring was suggested, I've attached a screenshot showing an example of the
"ML Cleanup lockup" situation. I went away and was eating dinner and watching television
(on a different computer) and I came back after about an hour to see this. There are 10 cores
running full tilt, according to "top", but the monitor panel does not seem to show anything
happening. The JEdit window shows "ML Cleanup". ML heap size is near the maximum of 64G
that I have set.

There previously was a "try" in the JEdit editor panel, and you can see the vestigial
"No proof found" in the Output panel. I had deleted the "try", but this does not recover
the session. JEdit remains responsive, but gets no updates from ML.

- Gene Stark

Screenshot from 2023-05-01 20-23-22.png


Last updated: Apr 23 2024 at 08:19 UTC