From: Florian Haftmann <florian.haftmann@cit.tum.de>
There is a reproducible break down in the build system:
https://build.proof.cit.tum.de/build?id=038ecd8c-6a3e-4e55-bacd-a455cf7f6b40
A few weeks ago there was a similar break down.
What is going on here?
Thanks a lot,
Florian
OpenPGP_0xA707172232CFA4E9.asc
OpenPGP_signature.asc
From: Makarius <makarius@sketis.net>
On 30/12/2025 20:49, Florian Haftmann wrote:
There is a reproducible break down in the build system:
https://build.proof.cit.tum.de/build?id=038ecd8c-6a3e-4e55-bacd-a455cf7f6b40
Thanks for the hint. I have now reset the database state of the build server.
A few weeks ago there was a similar break down.
What is going on here?
There is some fragility in the design and/or implementation of the cluster.
Rather soon, I need to sit down with Fabian Huch, to sort it out.
Makarius
From: Tobias Nipkow <nipkow@in.tum.de>
Unfortunately the build system has hit the same problem again.
Tobias
On 01/01/2026 22:46, Makarius wrote:
On 30/12/2025 20:49, Florian Haftmann wrote:
There is a reproducible break down in the build system:
https://build.proof.cit.tum.de/build?id=038ecd8c-6a3e-4e55-bacd-a455cf7f6b40
Thanks for the hint. I have now reset the database state of the build server.
A few weeks ago there was a similar break down.
What is going on here?
There is some fragility in the design and/or implementation of the cluster.
Rather soon, I need to sit down with Fabian Huch, to sort it out.Makarius
From: Makarius <makarius@sketis.net>
On 02/01/2026 09:15, Tobias Nipkow wrote:
Unfortunately the build system has hit the same problem again.
I have reset everything once more, and tried "isabelle build_task -a"
successfully.
Hopefully this is sufficient for the rest of the Christmas vacation. I will be
mostly unavailable at least until 07-Jan-2026.
Makarius
From: Fabian Huch <huch@in.tum.de>
There are two problems here:
the underlying problem is that the Isabelle/Scala process uses up a
lot more memory than it used to, causing OOM errors.
the build infrastructure cannot handle build processes that have
abnormally terminated.
To address this, we can:
(1) tune memory settings and investigate into why we're using up so much
more JVM memory than before -- I'll look into that first.
(2) Make the system more robust in the face of such problems. I need to
sit down with Makarius to discuss this.
(3) Make the scheduling aware of memory limitations. The system was
designed such that this is possible, but it requires more work. I'll
work on that once the immediate problems are solved.
Fabian
On 1/2/26 22:08, Makarius wrote:
On 02/01/2026 09:15, Tobias Nipkow wrote:
Unfortunately the build system has hit the same problem again.
I have reset everything once more, and tried "isabelle build_task -a"
successfully.Hopefully this is sufficient for the rest of the Christmas vacation. I
will be mostly unavailable at least until 07-Jan-2026.Makarius
From: Fabian Huch <huch@in.tum.de>
On 1/6/26 12:46, Fabian Huch wrote:
There are two problems here:
the underlying problem is that the Isabelle/Scala process uses up a
lot more memory than it used to, causing OOM errors.the build infrastructure cannot handle build processes that have
abnormally terminated.To address this, we can:
(1) tune memory settings and investigate into why we're using up so
much more JVM memory than before -- I'll look into that first.
I checked and tuned the system:
we had a rogue Isabelle process on one of the nodes that refused to
terminate, eating up ~30GiB of that machine's RAM
I reduced the maximal number of concurrent jobs on our lower-memory
machines (which have 64GiB each)
This worked fine for the past week; however the memory consumption of
the JVM process is still abnormally high.
Looking at memory snapshots of the build, one of the reasons is the
recent incorporation of Node_Status into the progress: command_timings
are stored as
command_timings:Map[Command, Command_Timings]
and the command (in the key) contains large markups, e.g. in
init_markups, during the lifetime of these timings: up to ~1GiB during
the build of the distribution, likely significantly more in an AFP
build. I would propose to only keep the Document_ID.Command here, see
the attached patch.
Fabian
From: Makarius <makarius@sketis.net>
On 13/01/2026 11:26, Fabian Huch wrote:
Looking at memory snapshots of the build, one of the reasons is the recent
incorporation of Node_Status into the progress: command_timings are stored ascommand_timings:Map[Command, Command_Timings]
and the command (in the key) contains large markups, e.g. in init_markups,
during the lifetime of these timings: up to ~1GiB during the build of the
distribution, likely significantly more in an AFP build.
Thanks for the measurements. I've already suspected a problem exactly there,
but did not know how to do memory profiling on the JVM.
Can you give some hints on how to do that?
Makarius
From: Fabian Huch <huch@in.tum.de>
On 1/13/26 13:54, Makarius wrote:
On 13/01/2026 11:26, Fabian Huch wrote:
Looking at memory snapshots of the build, one of the reasons is the
recent incorporation of Node_Status into the progress:
command_timings are stored ascommand_timings:Map[Command, Command_Timings]
and the command (in the key) contains large markups, e.g. in
init_markups, during the lifetime of these timings: up to ~1GiB
during the build of the distribution, likely significantly more in an
AFP build.
Thanks for the measurements. I've already suspected a problem exactly
there, but did not know how to do memory profiling on the JVM.Can you give some hints on how to do that?
I usually do this by analyzing multiple heap dumps of the JVM during the
run. Those can be generated either from within the JVM, like so:
def dump_heap(file: Path, include_live: Boolean =true): Unit = {
val mxBean = java.lang.management.ManagementFactory.getPlatformMXBean(
classOf[com.sun.management.HotSpotDiagnosticMXBean])
mxBean.dumpHeap(file.absolute.implode, include_live)
}
or by instructing the JVM from the outside, e.g. via 'jmap
-dump:format=b,file=heap.hprof <pid>'
These heap dumps can then be analyzed with tools such as Eclipse MAT or
JProfiler, in particular by:
looking at histograms of object classes and their shallow heap
size/retained heap size, e.g. here you see that command markups retain
~1GiB of heap:
analyzing the memory graph through tree views, e.g. here you can see
what why the command markups are retained (there is a delayed event to
update the progress, where the largest Command key has ~16MiB init_markups):
However, I couldn't get MAT to ignore the weak references of our XML
cache (displayed as object with largest retained heap); this makes it
far less useful here.
Perhaps this is possible by tinkering with the options, but I don't know
how -- I used to work with JProfiler, but that is commercial.
Fabian
xVNQjnDCttesdZ9N.png
IWTyC0NJNa2j0viE.png
From: Makarius <makarius@sketis.net>
On 13/01/2026 11:26, Fabian Huch wrote:
Looking at memory snapshots of the build, one of the reasons is the recent
incorporation of Node_Status into the progress: command_timings are stored ascommand_timings:Map[Command, Command_Timings]
and the command (in the key) contains large markups, e.g. in init_markups,
during the lifetime of these timings: up to ~1GiB during the build of the
distribution, likely significantly more in an AFP build. I would propose to
only keep the Document_ID.Command here, see the attached patch.
See now:
changeset: 83825:3ed2781b1cc5
user: wenzelm
date: Wed Jan 14 13:09:47 2026 +0100
files: src/Pure/PIDE/document_status.scala
src/Tools/jEdit/src/timing_dockable.scala
description:
more frugal persistent data, notably for batch builds --- proposed by Fabian
Huch, after Java heap measurements (see also 3f6280aedbcc, a110e7e24e55,
9cc5d77d505c);
It is your original change, with my explanation from the history. I've added a
few more changes on top, but they don't make a fundamental difference.
Note that I did not repeat any performance measurements. I want to do this
eventually, also to see if other problems are present.
Makarius
From: Makarius <makarius@sketis.net>
On 14/01/2026 21:22, Makarius wrote:
See now:
changeset: 83825:3ed2781b1cc5
user: wenzelm
date: Wed Jan 14 13:09:47 2026 +0100
files: src/Pure/PIDE/document_status.scala src/Tools/jEdit/src/
timing_dockable.scala
description:
more frugal persistent data, notably for batch builds --- proposed by Fabian
Huch, after Java heap measurements (see also 3f6280aedbcc, a110e7e24e55,
9cc5d77d505c);It is your original change, with my explanation from the history. I've added a
few more changes on top, but they don't make a fundamental difference.Note that I did not repeat any performance measurements. I want to do this
eventually, also to see if other problems are present.
I have now done some measurements with "isabelle build -o threads=8 -j2 -a -f"
on my Linux box (16 cores), using the included patch for the Isabelle/Scala
implementation of "isabelle build".
The Scala/Java process uses these settings:
ISABELLE_TOOL_JAVA_OPTIONS="-Djava.awt.headless=true -Xms512m -Xmx4g
-Xss16m -XX:+UseZGC -XX:+ZGenerational -XX:SoftMaxHeapSize=2g"
Attached is raw data from the parent of 3ed2781b1cc5 (1), and from
3ed2781b1cc5 (2). The diagram.png is from gnuplot:
plot '1.dat' using 1:2 smooth sbezier title "1 heap_size" noenhanced,
'1.dat' using 1:3 smooth sbezier title "1 heap_used" noenhanced, '2.dat' using
1:2 smooth sbezier title "2 heap_size" noenhanced, '2.dat' using 1:3 smooth
sbezier title "2 heap_used" noenhanced
Conclusion: There is nothing interesting to be seen here. The change hardly
makes a difference.
It could mean that occurrences of Document_ID.Command and Command are in
1-to-1 correspondence on the heap, i.e. no stored Command is later updated/copied.
Or it could mean that I've got something wrong in the experimental setup.
Makarius
ch
1.dat.xz
2.dat.xz
diagram.png
From: Makarius <makarius@sketis.net>
On 14/01/2026 21:22, Makarius wrote:
See now:
changeset: 83825:3ed2781b1cc5
user: wenzelm
date: Wed Jan 14 13:09:47 2026 +0100
files: src/Pure/PIDE/document_status.scala src/Tools/jEdit/src/
timing_dockable.scala
description:
more frugal persistent data, notably for batch builds --- proposed by Fabian
Huch, after Java heap measurements (see also 3f6280aedbcc, a110e7e24e55,
9cc5d77d505c);It is your original change, with my explanation from the history. I've added a
few more changes on top, but they don't make a fundamental difference.Note that I did not repeat any performance measurements. I want to do this
eventually, also to see if other problems are present.
I have now done some measurements with "isabelle build -o threads=8 -j2 -a -f"
on my Linux box (16 cores), using the included patch for the Isabelle/Scala
implementation of "isabelle build".
The Scala/Java process uses these settings:
ISABELLE_TOOL_JAVA_OPTIONS="-Djava.awt.headless=true -Xms512m -Xmx4g
-Xss16m -XX:+UseZGC -XX:+ZGenerational -XX:SoftMaxHeapSize=2g"
Attached is raw data 1.dat from the parent of 3ed2781b1cc5, and 2.dat from
3ed2781b1cc5. The diagram.png is from gnuplot:
plot '1.dat' using 1:2 smooth sbezier title "1 heap_size" noenhanced,
'1.dat' using 1:3 smooth sbezier title "1 heap_used" noenhanced, '2.dat' using
1:2 smooth sbezier title "2 heap_size" noenhanced, '2.dat' using 1:3 smooth
sbezier title "2 heap_used" noenhanced
Conclusion: There is nothing interesting to be seen here. The change hardly
makes a difference.
It could mean that occurrences of Document_ID.Command and Command are in
1-to-1 correspondence on the heap, i.e. no stored Command is later updated/copied.
Or it could mean that I've got something wrong in the experimental setup.
Makarius
From: Fabian Huch <huch@in.tum.de>
On 1/17/26 23:34, Makarius wrote:
On 14/01/2026 21:22, Makarius wrote:
See now:
changeset: 83825:3ed2781b1cc5
user: wenzelm
date: Wed Jan 14 13:09:47 2026 +0100
files: src/Pure/PIDE/document_status.scala src/Tools/jEdit/src/
timing_dockable.scala
description:
more frugal persistent data, notably for batch builds --- proposed by
Fabian Huch, after Java heap measurements (see also 3f6280aedbcc,
a110e7e24e55, 9cc5d77d505c);It is your original change, with my explanation from the history.
I've added a few more changes on top, but they don't make a
fundamental difference.Note that I did not repeat any performance measurements. I want to do
this eventually, also to see if other problems are present.I have now done some measurements with "isabelle build -o threads=8
-j2 -a -f" on my Linux box (16 cores), using the included patch for
the Isabelle/Scala implementation of "isabelle build".The Scala/Java process uses these settings:
ISABELLE_TOOL_JAVA_OPTIONS="-Djava.awt.headless=true -Xms512m
-Xmx4g -Xss16m -XX:+UseZGC -XX:+ZGenerational -XX:SoftMaxHeapSize=2g"Attached is raw data from the parent of 3ed2781b1cc5 (1), and from
3ed2781b1cc5 (2). The diagram.png is from gnuplot:plot '1.dat' using 1:2 smooth sbezier title "1 heap_size"
noenhanced, '1.dat' using 1:3 smooth sbezier title "1 heap_used"
noenhanced, '2.dat' using 1:2 smooth sbezier title "2 heap_size"
noenhanced, '2.dat' using 1:3 smooth sbezier title "2 heap_used"
noenhancedConclusion: There is nothing interesting to be seen here. The change
hardly makes a difference.It could mean that occurrences of Document_ID.Command and Command are
in 1-to-1 correspondence on the heap, i.e. no stored Command is later
updated/copied.
Hm. I had measured a small number of samples (since heap snapshots are
expensive and large) before and after the change and looked at the heap
composition. But it is possible that the (consistently) lower heap size
was coincidental, and the extended lifetime of these Command markups did
not actually matter -- I'll have to figure out how to properly deal with
our opaque cache in MAT.
Or it could mean that I've got something wrong in the experimental setup.
One problem with this experimental setup is that current heap size is
not a good metric for such a high-throughput application (most memory
used would actually be dead).
Also, the smoothing hides what is going on, e.g. when the heap limit is
hit (cf. plot for 1 below). I would rather use the MXBeans of the
garbage collector to obtain GC events and plot a proper peak-to-peak curve.
Fabian
Last updated: Feb 04 2026 at 02:22 UTC