Stream: Archive Mirror: Isabelle Users Mailing List

Topic: [isabelle] ML goes "incommunicado" for minutes at a time ...


view this post on Zulip Email Gateway (Aug 22 2022 at 13:11):

From: "Eugene W. Stark" <isabelle-users@starkeffect.com>
I have posted (and seen postings) about this general type of thing here before,
but I am getting a more refined view of what seems to me to be a problem and want
to ask again and provide some data points.

I am running Isabelle 2016 under Ubuntu 14.04 LTS. I am using the 64-bit
version on two similar machines, both of which are Intel(R) Core(TM)
i5-4670K CPU @ 3.40GHz (4 cores). One machine has 24GB RAM, and I have
limited the ML heap to 16GB to avoid spamming other activities, and the other
machine has 16GB RAM, and I have limited the ML heap to 8GB for the same
reason. The problem I am describing occurs on both machines, but for definiteness,
the screenshots I took today were done on the 16GB machine.

The problem is that, during sledgehammering using "try" it seems that ML will
go "incommunicado" for minutes at a time, and it is hard to figure out why.
During this period the CPU usage ranges anywhere from 100% to 400%, where 100%
means one core. Based on previous postings, I would normally assume that it is
due to a major GC taking place, but also during these periods, the "Monitor"
window in JEdit does not receive any data updates from ML. Once ML has come
back online, the Monitor window starts receiving data again, but it interpolates
the missing data by a line. This makes it difficult to understand exactly what
is going on. I don't really know whether there are any conditions under which
it is supposed to be the case that ML will stop sending data points to the
Monitor window, but it seems odd to me. Of course, the long pause for an
unknown period is also frustrating, and in many cases I just terminate and
restart Isabelle, substituting the known wait for my theories to be reloaded and
rechecked for the unknown wait for ML to return.

For what it's worth, Isabelle memory use is limited, it is honoring the limits,
paging traffic is low, and CPU utilization is high, so it is not a thrashing
situation that is taking place.

I am attaching some screenshots that I took today when I let the sledghammering
run for awhile through multiple cycles of this behavior.
Screenshot from 2016-04-21 10:20:01.png
Screenshot from 2016-04-21 10:20:16.png
Screenshot from 2016-04-21 10:20:27.png
Screenshot from 2016-04-21 10:20:37.png
Screenshot from 2016-04-21 10:20:48.png

view this post on Zulip Email Gateway (Aug 22 2022 at 13:11):

From: Makarius <makarius@sketis.net>
On Thu, 21 Apr 2016, Eugene W. Stark wrote:

The problem is that, during sledgehammering using "try" it seems that ML
will go "incommunicado" for minutes at a time, and it is hard to figure
out why.

On the snapshot from the "Heap" window, you can see spikes in memory
usage, leading to what appears to be a major GC. Upon completion of the
major GC, the "freeze" sets in, and no data is sent to the monitor for
several minutes. When the data resumes, the heapsize is much lower, but
quickly spikes up to the max and the cycle repeats. Because there is no
data during the freeze period, it is difficult to tell what is going on,
but I have to wonder what it is doing during those 3+ minutes when there
is only a short spike of computing activity that takes place before it
starts again.

Could anyone comment on whether this behavior is considered "normal"?

It sounds indeed like a normal consequence of a situation where the ML
heap is under high pressure: tons of data produced and the GC + heap
compaction phases trying to reclaim memory.

When the Poly/ML runtime system performs such GC-related operations, all
ML threads are stopped. The Monitor data is produced by such a regular ML
thread, so nothing will be reported during that time.

I am suspicious of some kind of thread signalling/synchronization issue,
but not knowing anything about the actual implementation, can only
speculate.

From the description of the problem, it does not really look like that.
Although there is always a possibility for something really strange going
on.

In conclusion, my guess from a distance is that "sledgehammer" or "try"
produce more data than expected, potentially due to unusually big
theories. You could also try to reset the machine-learning information in
$ISABELLE_HOME_USER/mash_state.

Since you are using 64bit Poly/ML with substantial amounts of heap space,
the question is what makes the theories so big that this is required. Even
the biggest applications on AFP still fit into the small memory model of
32bit with 3.5 GB heap. There is a penalty of approx. factor 2 in memory
requirements, when switching from 32bit to 64bit.

Makarius

view this post on Zulip Email Gateway (Aug 22 2022 at 13:13):

From: Mathias Fleury <mathias.fleury12@gmail.com>
Hello all,

I have seen the described behaviour too. So far, I know that:
it is linked to MaSh (i.e., MePo does not exhibit this slow-down). You can deactivate the learning in the Sledgehammer options (via the Isabelle options in Isabelle/jEdit or sledgehammer_params).
removing the $ISABELLE_HOME_USER/mash_state file helps.

I have not yet found a deterministic way to reproduce it.

Mathias


Last updated: Apr 25 2024 at 20:15 UTC