Stream: Mirror: Isabelle Users Mailing List

Topic: [isabelle] Copilot (or other programming assistants) for ...


view this post on Zulip Email Gateway (Dec 08 2025 at 20:04):

From: "\"Bisping, Benjamin\"" <cl-isabelle-users@lists.cam.ac.uk>

Hello everybody,

This week, I plan to convince a computer science audience that interactive theorem proving is the future of mathematics and verification. (I'm sure many on this mailing list know this situation.) I intend to build some arguments around Terence Tao getting ten-thousands of people to watch him use Lean+Copilot this year<https://www.youtube.com/watch?v=cyyR7j2ChCI>. As I will employ Isabelle/HOL for the demonstration, I strongly anticipate questions in the direction of how to hook it up with GitHub Copilot.

I tried it myself and Isabelle+Copilot seems highly non-trivial due to the incompatibilities introduced by Microsoft on one side and the Isabelle/Codium package on the other side.* This is a pity because, in principle, this year's LLM coding assistants appear to finally be quite decent at finding Isabelle proofs. If I open Isabelle/HOL theories in VS Code, GPT-5.1-Codex can solve tricky stuff, for instance, find fitting induction invariants.

But of course, the workflow to co-edit theories between VSCode (or ChatGPT promts) and JEdit is too clunky to be proposed in my “sales pitch” for ITPs. And if there's no easy automation for writing proofs in Isabelle, but there is in Lean, I already know how 100 % of the listeners will decide which of the two systems to look into.

I wonder whether people here have positive experiences / suggestions on LLM-supported proof generation in Isabelle I could report to advertise it?**

Kind regards,

Ben

view this post on Zulip Email Gateway (Dec 08 2025 at 20:07):

From: Makarius <makarius@sketis.net>

On 08/12/2025 21:04, "Bisping, Benjamin" (via cl-isabelle-users Mailing List)
wrote:

But of course, the workflow to co-edit theories between VSCode (or ChatGPT
promts) and JEdit is too clunky to be proposed in my “sales pitch” for ITPs.
And if there's no easy automation for writing proofs in Isabelle, but there is
in Lean, I already know how 100 % of the listeners will decide which of the
two systems to look into.

Oh dear, better use Lean then.

Side-remark: The tool of choice for proof automation is called "Sledgehammer".

Makarius

view this post on Zulip Email Gateway (Dec 08 2025 at 20:12):

From: Dominic Mulligan <cl-isabelle-users@lists.cam.ac.uk>

Dear Benjamin,

You may find the I/Q plugin interesting. It exposes Isabelle as an MCP server to LLMs and allows the agent to interactively edit files, query proof states, and so on.

The code is available here: https://github.com/awslabs/autocorrode

It’s mainly been tested with Amazon’s Kiro CLI agent (formerly called Q CLI), but I believe at least one other person has used it in “anger” with another agent.

Thanks,
Dominic

On 8 Dec 2025, at 20:04, Bisping, Benjamin (via cl-isabelle-users Mailing List) <cl-isabelle-users@lists.cam.ac.uk> wrote:

Hello everybody,

This week, I plan to convince a computer science audience that interactive theorem proving is the future of mathematics and verification. (I'm sure many on this mailing list know this situation.) I intend to build some arguments around Terence Tao getting ten-thousands of people to watch him use Lean+Copilot this year <https://www.youtube.com/watch?v=cyyR7j2ChCI>. As I will employ Isabelle/HOL for the demonstration, I strongly anticipate questions in the direction of how to hook it up with GitHub Copilot.

I tried it myself and Isabelle+Copilot seems highly non-trivial due to the incompatibilities introduced by Microsoft on one side and the Isabelle/Codium package on the other side.* This is a pity because, in principle, this year's LLM coding assistants appear to finally be quite decent at finding Isabelle proofs. If I open Isabelle/HOL theories in VS Code, GPT-5.1-Codex can solve tricky stuff, for instance, find fitting induction invariants.

But of course, the workflow to co-edit theories between VSCode (or ChatGPT promts) and JEdit is too clunky to be proposed in my “sales pitch” for ITPs. And if there's no easy automation for writing proofs in Isabelle, but there is in Lean, I already know how 100 % of the listeners will decide which of the two systems to look into.

I wonder whether people here have positive experiences / suggestions on LLM-supported proof generation in Isabelle I could report to advertise it?**

Kind regards,

Ben

view this post on Zulip Email Gateway (Dec 08 2025 at 20:16):

From: Wolfram Kahl <kahl@cas.mcmaster.ca>

On Mon, Dec 08, 2025 at 09:07:26PM +0100, Makarius wrote:

Side-remark: The tool of choice for proof automation is called
"Sledgehammer".

Does that SOUND elegant and/or sophisticated to you?

Perhaps a marketing problem... ;-)

Wolfram

view this post on Zulip Email Gateway (Dec 08 2025 at 20:19):

From: Makarius <makarius@sketis.net>

On 08/12/2025 21:16, Wolfram Kahl wrote:

On Mon, Dec 08, 2025 at 09:07:26PM +0100, Makarius wrote:

Side-remark: The tool of choice for proof automation is called
"Sledgehammer".

Does that SOUND elegant and/or sophisticated to you?

Yes. It has become a well-established name in the ITP community. There are
also many other "Hammer" projects, but not that successful.

Perhaps a marketing problem... ;-)
Who cares about marketing? I don't.

Makarius

view this post on Zulip Email Gateway (Dec 08 2025 at 20:21):

From: Yakoub Nemouchi <y.nemouchi@gmail.com>

There is a DARPA program called PROVERS where I know some people are trying
to come with such a feature for Isabelle. You may want to reach out to
performers of DARPA PROVERS. I believe that the Stanford University team
are cooking something for Isabelle.

Although, computer scientists have to take a minute and remember that
undecidable problems such as finding proofs in HOL or finding invariants or
even writing a complete specification cannot be automated. This is what we
learned in 3rd semester of CS class, specifically under theory of
computation, more specifically under computability theory.

The best solution figured are ITPs themselves, where we have humans in the
loop to create computer assisted abstractions combined with symbolic AI
tools (i.e., automated reasoning via for example natural deduction)to
increase automations.

LLMs are the biggest imposters I have ever seen. They cannot be trusted to
even handle deterministic and repetitive processes. Let alone trusting them
to do reasoning.

My 2 cents.

Yakoub.

On Mon, Dec 8, 2025 at 1:07 PM Makarius <makarius@sketis.net> wrote:

On 08/12/2025 21:04, "Bisping, Benjamin" (via cl-isabelle-users Mailing
List)
wrote:

But of course, the workflow to co-edit theories between VSCode (or
ChatGPT
promts) and JEdit is too clunky to be proposed in my “sales pitch” for
ITPs.
And if there's no easy automation for writing proofs in Isabelle, but
there is
in Lean, I already know how 100 % of the listeners will decide which of
the
two systems to look into.

Oh dear, better use Lean then.

Side-remark: The tool of choice for proof automation is called
"Sledgehammer".

Makarius

view this post on Zulip Email Gateway (Dec 08 2025 at 20:42):

From: Lawrence Paulson <lp15@cam.ac.uk>

It is quite hard to beat the simple combination of the structured Isar language and sledgehammer. First, just try sledgehammer. If it fails, think of an intermediate formula and try using sledgehammer to close both gaps. If not, continue with more intermediate formulas. Many straightforward proofs can be found that way.

Larry
On 8 Dec 2025 at 20:21 +0000, Yakoub Nemouchi <y.nemouchi@gmail.com>, wrote:

The best solution figured are ITPs themselves, where we have humans in the loop to create computer assisted abstractions combined with symbolic AI tools (i.e., automated reasoning via for example natural deduction)to increase automations.

view this post on Zulip Email Gateway (Dec 08 2025 at 21:12):

From: Yakoub Nemouchi <y.nemouchi@gmail.com>

… and because the word AI is making the trend these days. Sledgehammer is
your symbolic AI tool. And humans with their brain cells brings
their neurons. And so the interaction between a human and an ITP such as
Isabelle gives you the most powerful neuro symbolic AI tool.

On Mon, Dec 8, 2025 at 1:42 PM Lawrence Paulson <lp15@cam.ac.uk> wrote:

It is quite hard to beat the simple combination of the structured Isar
language and sledgehammer. First, just try sledgehammer. If it fails, think
of an intermediate formula and try using sledgehammer to close both gaps.
If not, continue with more intermediate formulas. Many straightforward
proofs can be found that way.

Larry
On 8 Dec 2025 at 20:21 +0000, Yakoub Nemouchi <y.nemouchi@gmail.com>,
wrote:

The best solution figured are ITPs themselves, where we have humans in the
loop to create computer assisted abstractions combined with symbolic AI
tools (i.e., automated reasoning via for example natural deduction)to
increase automations.

view this post on Zulip Email Gateway (Dec 08 2025 at 21:17):

From: Kevin Kappelmann <cl-isabelle-users@lists.cam.ac.uk>

Hi Ben,

You can use Isabelle/VSCode (i.e. use isabelle vscode instead of isabelle jedit) and then install any extension you like from the open-vsx marketplace [1], e.g. continue [2], which can be connected to GPT-5.1 (I believe).

Best wishes,

Kevin

[1] https://open-vsx.org/
[2] https://open-vsx.org/extension/Continue/continue

On 8 December 2025 21:04:32 CET, "Bisping, Benjamin" <cl-isabelle-users@lists.cam.ac.uk> wrote:

Hello everybody,

This week, I plan to convince a computer science audience that interactive theorem proving is the future of mathematics and verification. (I'm sure many on this mailing list know this situation.) I intend to build some arguments around Terence Tao getting ten-thousands of people to watch him use Lean+Copilot this year<https://www.youtube.com/watch?v=cyyR7j2ChCI>. As I will employ Isabelle/HOL for the demonstration, I strongly anticipate questions in the direction of how to hook it up with GitHub Copilot.

I tried it myself and Isabelle+Copilot seems highly non-trivial due to the incompatibilities introduced by Microsoft on one side and the Isabelle/Codium package on the other side.* This is a pity because, in principle, this year's LLM coding assistants appear to finally be quite decent at finding Isabelle proofs. If I open Isabelle/HOL theories in VS Code, GPT-5.1-Codex can solve tricky stuff, for instance, find fitting induction invariants.

But of course, the workflow to co-edit theories between VSCode (or ChatGPT promts) and JEdit is too clunky to be proposed in my “sales pitch” for ITPs. And if there's no easy automation for writing proofs in Isabelle, but there is in Lean, I already know how 100 % of the listeners will decide which of the two systems to look into.

I wonder whether people here have positive experiences / suggestions on LLM-supported proof generation in Isabelle I could report to advertise it?**

Kind regards,

Ben

view this post on Zulip Email Gateway (Dec 08 2025 at 22:11):

From: Changyang Zeng <cl-isabelle-users@lists.cam.ac.uk>

Glad to see that some of us are building agents with Isabelle, I am just
wondering what kind of software or hardware people tend to use Isabelle to
prove?

Thanks,
Changyang

On Mon, Dec 8, 2025 at 2:05 PM Mete Polat <metepolat2000@gmail.com> wrote:

Currently building an agentic IDE at omergence.dev with Isabelle in the
backend and GitHub Copilot built in. Hit me up if you have any questions or
are interested.

On Mon, Dec 8, 2025, 12:04 "Bisping, Benjamin" <
cl-isabelle-users@lists.cam.ac.uk> wrote:

Hello everybody,

This week, I plan to convince a computer science audience that
interactive theorem proving is the future of mathematics and verification.
(I'm sure many on this mailing list know this situation.) I intend to build
some arguments around Terence Tao getting ten-thousands of people to
watch him use Lean+Copilot this year
<https://www.youtube.com/watch?v=cyyR7j2ChCI>. As I will employ
Isabelle/HOL for the demonstration, I strongly anticipate questions in the
direction of how to hook it up with GitHub Copilot.

I tried it myself and Isabelle+Copilot seems highly non-trivial due to
the incompatibilities introduced by Microsoft on one side and the
Isabelle/Codium package on the other side.* This is a pity because, in
principle, this year's LLM coding assistants appear to finally be quite
decent at finding Isabelle proofs. If I open Isabelle/HOL theories in VS
Code, GPT-5.1-Codex can solve tricky stuff, for instance, find fitting
induction invariants.

But of course, the workflow to co-edit theories between VSCode (or
ChatGPT promts) and JEdit is too clunky to be proposed in my “sales pitch”
for ITPs. And if there's no easy automation for writing proofs in Isabelle,
but there is in Lean, I already know how 100 % of the listeners will decide
which of the two systems to look into.

I wonder whether people here have positive experiences / suggestions on
LLM-supported proof generation in Isabelle I could report to advertise it?**

Kind regards,

Ben

view this post on Zulip Email Gateway (Dec 08 2025 at 22:28):

From: Mete Polat <metepolat2000@gmail.com>

Currently building an agentic IDE at omergence.dev with Isabelle in the
backend and GitHub Copilot built in. Hit me up if you have any questions or
are interested.

On Mon, Dec 8, 2025, 12:04 "Bisping, Benjamin" <
cl-isabelle-users@lists.cam.ac.uk> wrote:

Hello everybody,

This week, I plan to convince a computer science audience that interactive
theorem proving is the future of mathematics and verification. (I'm sure
many on this mailing list know this situation.) I intend to build some
arguments around Terence Tao getting ten-thousands of people to watch him
use Lean+Copilot this year <https://www.youtube.com/watch?v=cyyR7j2ChCI>.
As I will employ Isabelle/HOL for the demonstration, I strongly anticipate
questions in the direction of how to hook it up with GitHub Copilot.

I tried it myself and Isabelle+Copilot seems highly non-trivial due to the
incompatibilities introduced by Microsoft on one side and the
Isabelle/Codium package on the other side.* This is a pity because, in
principle, this year's LLM coding assistants appear to finally be quite
decent at finding Isabelle proofs. If I open Isabelle/HOL theories in VS
Code, GPT-5.1-Codex can solve tricky stuff, for instance, find fitting
induction invariants.

But of course, the workflow to co-edit theories between VSCode (or ChatGPT
promts) and JEdit is too clunky to be proposed in my “sales pitch” for
ITPs. And if there's no easy automation for writing proofs in Isabelle, but
there is in Lean, I already know how 100 % of the listeners will decide
which of the two systems to look into.

I wonder whether people here have positive experiences / suggestions on
LLM-supported proof generation in Isabelle I could report to advertise it?**

Kind regards,

Ben

view this post on Zulip Email Gateway (Dec 09 2025 at 17:16):

From: Jonathan Julián Huerta y Munive <jonjulian23@gmail.com>

Hi Benjamin,

You might want to try the resources in the DeepIsaHOL project
<https://github.com/yonoteam/DeepIsaHOL>.
Currently it has support for ChatGPT, Gemini, Ollama, and HuggingFace
Transformers.
It is very much in a prototypical state and will be upgraded in the future
to adhere to Isabelle standards. However, for the demonstration purposes
you mention, it should suffice. For instance, here is a video
<https://youtu.be/Uq7I5sGaHo0> showing ChatGPT proving simple theorems in
Isabelle. The result is still interactive theorem proving, but the
DeepIsaHOL project has support
<https://ieeexplore.ieee.org/document/11024329> for automating the manual
parts at the end of the video.

Best wishes,
Jonathan


Message-ID: <3ee6efa75b672760cd3484fe2ca903abafddfdd5.camel@tu-berlin.de>
Date: Mon, 8 Dec 2025 20:04:32 +0000
From: "Bisping, Benjamin" <benjamin.bisping@tu-berlin.de>
Subject: [isabelle] Copilot (or other programming assistants) for Isabelle?

Hello everybody,

This week, I plan to convince a computer science audience that interactive
theorem proving is the future of mathematics and verification. (I'm sure
many
on this mailing list know this situation.) I intend to build some arguments
around Terence Tao getting ten-thousands of people to watch him use Lean
+Copilot this year<https://www.youtube.com/watch?v=cyyR7j2ChCI>. As I will
employ Isabelle/HOL for the demonstration, I strongly anticipate questions
in
the direction of how to hook it up with GitHub Copilot.

I tried it myself and Isabelle+Copilot seems highly non-trivial due to the
incompatibilities introduced by Microsoft on one side and the
Isabelle/Codium
package on the other side.* This is a pity because, in principle, this
year's
LLM coding assistants appear to finally be quite decent at finding Isabelle
proofs. If I open Isabelle/HOL theories in VS Code, GPT-5.1-Codex can solve
tricky stuff, for instance, find fitting induction invariants.

But of course, the workflow to co-edit theories between VSCode (or ChatGPT
promts) and JEdit is too clunky to be proposed in my “sales pitch” for
ITPs.
And if there's no easy automation for writing proofs in Isabelle, but
there is
in Lean, I already know how 100 % of the listeners will decide which of the
two systems to look into.

I wonder whether people here have positive experiences / suggestions on
LLM-
supported proof generation in Isabelle I could report to advertise it?**

Kind regards,

Ben

https://isabelle.systems/zulip-archive/stream/202961-General/topic/Isabelle.2FVSCode.20Github.20Copilot.html
.
The 2025 message on the topic to this mailing list went unanswered
https://lists.cam.ac.uk/sympa/arc/cl-isabelle-users/2025-05/msg00002.html
.
** Prior discussion about the discontinued Isabelle plugin for VS Code,
which
otherwise would at least be some answer.
https://lists.cam.ac.uk/sympa/arc/cl-isabelle-users/2023-10/msg00005.html

--

*Jonathan Julian Huerta y Munive *

Researcher on formal methods and automated reasoning

view this post on Zulip Email Gateway (Dec 10 2025 at 03:02):

From: Yakoub Nemouchi <y.nemouchi@gmail.com>

It would be great if someone holding a position at academia does a study
specifically on ITPs. Comparing the amount of energy needed for
Sledgehammer to generate a proof. And the amount of energy consumed to
train GPT agents and prompting them to generate the same proof. And study
the impact of energy consumption from prompting and training GPT agents on
societies, communities and environment.

Best wishes,

Yakoub.
On Tue, Dec 9, 2025 at 10:06 AM Jonathan Julián Huerta y Munive <
jonjulian23@gmail.com> wrote:

Hi Benjamin,

You might want to try the resources in the DeepIsaHOL project
<https://github.com/yonoteam/DeepIsaHOL>.
Currently it has support for ChatGPT, Gemini, Ollama, and HuggingFace
Transformers.
It is very much in a prototypical state and will be upgraded in the future
to adhere to Isabelle standards. However, for the demonstration purposes
you mention, it should suffice. For instance, here is a video
<https://youtu.be/Uq7I5sGaHo0> showing ChatGPT proving simple theorems in
Isabelle. The result is still interactive theorem proving, but the
DeepIsaHOL project has support
<https://ieeexplore.ieee.org/document/11024329> for automating the manual
parts at the end of the video.

Best wishes,
Jonathan


Message-ID: <3ee6efa75b672760cd3484fe2ca903abafddfdd5.camel@tu-berlin.de>
Date: Mon, 8 Dec 2025 20:04:32 +0000
From: "Bisping, Benjamin" <benjamin.bisping@tu-berlin.de>
Subject: [isabelle] Copilot (or other programming assistants) for
Isabelle?

Hello everybody,

This week, I plan to convince a computer science audience that interactive
theorem proving is the future of mathematics and verification. (I'm sure
many
on this mailing list know this situation.) I intend to build some
arguments
around Terence Tao getting ten-thousands of people to watch him use Lean
+Copilot this year<https://www.youtube.com/watch?v=cyyR7j2ChCI>. As I
will
employ Isabelle/HOL for the demonstration, I strongly anticipate
questions in
the direction of how to hook it up with GitHub Copilot.

I tried it myself and Isabelle+Copilot seems highly non-trivial due to the
incompatibilities introduced by Microsoft on one side and the
Isabelle/Codium
package on the other side.* This is a pity because, in principle, this
year's
LLM coding assistants appear to finally be quite decent at finding
Isabelle
proofs. If I open Isabelle/HOL theories in VS Code, GPT-5.1-Codex can
solve
tricky stuff, for instance, find fitting induction invariants.

But of course, the workflow to co-edit theories between VSCode (or ChatGPT
promts) and JEdit is too clunky to be proposed in my “sales pitch” for
ITPs.
And if there's no easy automation for writing proofs in Isabelle, but
there is
in Lean, I already know how 100 % of the listeners will decide which of
the
two systems to look into.

I wonder whether people here have positive experiences / suggestions on
LLM-
supported proof generation in Isabelle I could report to advertise it?**

Kind regards,

Ben

https://isabelle.systems/zulip-archive/stream/202961-General/topic/Isabelle.2FVSCode.20Github.20Copilot.html
.
The 2025 message on the topic to this mailing list went unanswered
https://lists.cam.ac.uk/sympa/arc/cl-isabelle-users/2025-05/msg00002.html
.
** Prior discussion about the discontinued Isabelle plugin for VS Code,
which
otherwise would at least be some answer.
https://lists.cam.ac.uk/sympa/arc/cl-isabelle-users/2023-10/msg00005.html

--

*Jonathan Julian Huerta y Munive *

Researcher on formal methods and automated reasoning

view this post on Zulip Email Gateway (Dec 10 2025 at 08:47):

From: Lawrence Paulson <lp15@cam.ac.uk>

Sledgehammer run easily on a laptop and its energy consumption is essentially zero compared with the amount of energy consumed to train GPT agents.

Larry
On 10 Dec 2025 at 03:01 +0000, Yakoub Nemouchi <y.nemouchi@gmail.com>, wrote:
It would be great if someone holding a position at academia does a study specifically on ITPs. Comparing the amount of energy needed for Sledgehammer to generate a proof. And the amount of energy consumed to train GPT agents and prompting them to generate the same proof. And study the impact of energy consumption from prompting and training GPT agents on societies, communities and environment.

Best wishes,

Yakoub.
On Tue, Dec 9, 2025 at 10:06 AM Jonathan Julián Huerta y Munive <jonjulian23@gmail.com<mailto:jonjulian23@gmail.com>> wrote:
Hi Benjamin,

You might want to try the resources in the DeepIsaHOL project<https://github.com/yonoteam/DeepIsaHOL>.
Currently it has support for ChatGPT, Gemini, Ollama, and HuggingFace Transformers.
It is very much in a prototypical state and will be upgraded in the future to adhere to Isabelle standards. However, for the demonstration purposes you mention, it should suffice. For instance, here is a video<https://youtu.be/Uq7I5sGaHo0> showing ChatGPT proving simple theorems in Isabelle. The result is still interactive theorem proving, but the DeepIsaHOL project has support<https://ieeexplore.ieee.org/document/11024329> for automating the manual parts at the end of the video.

Best wishes,
Jonathan


Message-ID: <3ee6efa75b672760cd3484fe2ca903abafddfdd5.camel@tu-berlin.de<mailto:3ee6efa75b672760cd3484fe2ca903abafddfdd5.camel@tu-berlin.de>>
Date: Mon, 8 Dec 2025 20:04:32 +0000
From: "Bisping, Benjamin" <benjamin.bisping@tu-berlin.de<mailto:benjamin.bisping@tu-berlin.de>>
Subject: [isabelle] Copilot (or other programming assistants) for Isabelle?

Hello everybody,

This week, I plan to convince a computer science audience that interactive
theorem proving is the future of mathematics and verification. (I'm sure many
on this mailing list know this situation.) I intend to build some arguments
around Terence Tao getting ten-thousands of people to watch him use Lean
+Copilot this year<https://www.youtube.com/watch?v=cyyR7j2ChCI>. As I will
employ Isabelle/HOL for the demonstration, I strongly anticipate questions in
the direction of how to hook it up with GitHub Copilot.

I tried it myself and Isabelle+Copilot seems highly non-trivial due to the
incompatibilities introduced by Microsoft on one side and the Isabelle/Codium
package on the other side.* This is a pity because, in principle, this year's
LLM coding assistants appear to finally be quite decent at finding Isabelle
proofs. If I open Isabelle/HOL theories in VS Code, GPT-5.1-Codex can solve
tricky stuff, for instance, find fitting induction invariants.

But of course, the workflow to co-edit theories between VSCode (or ChatGPT
promts) and JEdit is too clunky to be proposed in my “sales pitch” for ITPs.
And if there's no easy automation for writing proofs in Isabelle, but there is
in Lean, I already know how 100 % of the listeners will decide which of the
two systems to look into.

I wonder whether people here have positive experiences / suggestions on LLM-
supported proof generation in Isabelle I could report to advertise it?**

Kind regards,

Ben

--

Jonathan Julian Huerta y Munive

Researcher on formal methods and automated reasoning

view this post on Zulip Email Gateway (Dec 10 2025 at 14:08):

From: Yakoub Nemouchi <y.nemouchi@gmail.com>

That was exactly my point, Larry.

Someone needs to put the light on these GPT agents with data as evidence.
Sledgehammer used to give me an Isabelle proof with less energy and less
time for quite complicated statements. Most of the time when the statement
is complicated the GPT agents will consume more energy and more time
without even generating a proof. And because GPT agents are good liars they
still generate something and call it “ a profile of a proof”.

Not only that, when I prompt them to translate some spec written in natural
language to Isabelle/HOL, even if they generate something that works it is
always looking clunky, not generic, can be better her and there. Namely, it
adds an unnecessary layer of distraction and time wasting for users.

Maybe those GPT agents are good for certain tasks, but definitely not for
programming let alone reasoning.

On Wed, Dec 10, 2025 at 1:47 AM Lawrence Paulson <lp15@cam.ac.uk> wrote:

Sledgehammer run easily on a laptop and its energy consumption is
essentially zero compared with the amount of energy consumed to train GPT
agents.

Larry
On 10 Dec 2025 at 03:01 +0000, Yakoub Nemouchi <y.nemouchi@gmail.com>,
wrote:

It would be great if someone holding a position at academia does a study
specifically on ITPs. Comparing the amount of energy needed for
Sledgehammer to generate a proof. And the amount of energy consumed to
train GPT agents and prompting them to generate the same proof. And study
the impact of energy consumption from prompting and training GPT agents on
societies, communities and environment.

Best wishes,

Yakoub.
On Tue, Dec 9, 2025 at 10:06 AM Jonathan Julián Huerta y Munive <
jonjulian23@gmail.com> wrote:

Hi Benjamin,

You might want to try the resources in the DeepIsaHOL project
<https://github.com/yonoteam/DeepIsaHOL>.
Currently it has support for ChatGPT, Gemini, Ollama, and HuggingFace
Transformers.
It is very much in a prototypical state and will be upgraded in the
future to adhere to Isabelle standards. However, for the demonstration
purposes you mention, it should suffice. For instance, here is a video
<https://youtu.be/Uq7I5sGaHo0> showing ChatGPT proving simple theorems
in Isabelle. The result is still interactive theorem proving, but the
DeepIsaHOL project has support
<https://ieeexplore.ieee.org/document/11024329> for automating the
manual parts at the end of the video.

Best wishes,
Jonathan


Message-ID: <3ee6efa75b672760cd3484fe2ca903abafddfdd5.camel@tu-berlin.de

Date: Mon, 8 Dec 2025 20:04:32 +0000
From: "Bisping, Benjamin" <benjamin.bisping@tu-berlin.de>
Subject: [isabelle] Copilot (or other programming assistants) for
Isabelle?

Hello everybody,

This week, I plan to convince a computer science audience that
interactive
theorem proving is the future of mathematics and verification. (I'm sure
many
on this mailing list know this situation.) I intend to build some
arguments
around Terence Tao getting ten-thousands of people to watch him use Lean
+Copilot this year<https://www.youtube.com/watch?v=cyyR7j2ChCI>. As I
will
employ Isabelle/HOL for the demonstration, I strongly anticipate
questions in
the direction of how to hook it up with GitHub Copilot.

I tried it myself and Isabelle+Copilot seems highly non-trivial due to
the
incompatibilities introduced by Microsoft on one side and the
Isabelle/Codium
package on the other side.* This is a pity because, in principle, this
year's
LLM coding assistants appear to finally be quite decent at finding
Isabelle
proofs. If I open Isabelle/HOL theories in VS Code, GPT-5.1-Codex can
solve
tricky stuff, for instance, find fitting induction invariants.

But of course, the workflow to co-edit theories between VSCode (or
ChatGPT
promts) and JEdit is too clunky to be proposed in my “sales pitch” for
ITPs.
And if there's no easy automation for writing proofs in Isabelle, but
there is
in Lean, I already know how 100 % of the listeners will decide which of
the
two systems to look into.

I wonder whether people here have positive experiences / suggestions on
LLM-
supported proof generation in Isabelle I could report to advertise it?**

Kind regards,

Ben

  • Prior 2023 hacky solution for Copilot in Isabelle's VS Codium

https://isabelle.systems/zulip-archive/stream/202961-General/topic/Isabelle.2FVSCode.20Github.20Copilot.html
.
The 2025 message on the topic to this mailing list went unanswered
https://lists.cam.ac.uk/sympa/arc/cl-isabelle-users/2025-05/msg00002.html
.
** Prior discussion about the discontinued Isabelle plugin for VS Code,
which
otherwise would at least be some answer.
https://lists.cam.ac.uk/sympa/arc/cl-isabelle-users/2023-10/msg00005.html

--

*Jonathan Julian Huerta y Munive *

Researcher on formal methods and automated reasoning

view this post on Zulip Email Gateway (Dec 10 2025 at 17:08):

From: Lawrence Paulson <lp15@cam.ac.uk>

Useful to know. I find them good for finding things in our libraries (both lemmas and obscure syntax). They also generate pretend proofs.

Larry

On 10 Dec 2025, at 14:07, Yakoub Nemouchi <y.nemouchi@gmail.com> wrote:

Not only that, when I prompt them to translate some spec written in natural language to Isabelle/HOL, even if they generate something that works it is always looking clunky, not generic, can be better her and there. Namely, it adds an unnecessary layer of distraction and time wasting for users.

view this post on Zulip Email Gateway (Dec 10 2025 at 17:21):

From: Yakoub Nemouchi <y.nemouchi@gmail.com>

I agree. Using them “carefully” as an advanced search engine can be
helpful. Especially now we have a huge AFP repository and not all the users
are aware of it nor they setup it when they download Isabelle main
distribution.

My answers are based on the assumption coming from this thread which is
people are using them to find proofs in HOL!

Best wishes,
Yakoub.
On Wed, Dec 10, 2025 at 10:08 AM Lawrence Paulson <lp15@cam.ac.uk> wrote:

Useful to know. I find them good for finding things in our libraries (both
lemmas and obscure syntax). They also generate pretend proofs.

Larry

On 10 Dec 2025, at 14:07, Yakoub Nemouchi <y.nemouchi@gmail.com> wrote:

Not only that, when I prompt them to translate some spec written in
natural language to Isabelle/HOL, even if they generate something that
works it is always looking clunky, not generic, can be better her and
there. Namely, it adds an unnecessary layer of distraction and time wasting
for users.

view this post on Zulip Email Gateway (Dec 10 2025 at 20:10):

From: Josef Urban <josef.urban@gmail.com>

I said recently in a talk that the ATP and hammer algorithms were
"generative AI" long before LLMs appeared [1]. For someone using a hammer
(or an efficient tactical prover like TacticToe/Tactician) routinely to get
correct proofs, "Chatgpt as a prover" felt like an expensive joke when it
first appeared. (I have started to call Chatgpt a "gateway drug to ML" :).

I would hope the "magic of scaling up" hype is mostly over and even the
LLM/DL community has recently (e.g. with the Deepseek paper) acknowledged
that efficient search/reasoning and combinations of algorithms matter a lot
(cf "agentic AI" as the next emerging buzzword).

Projects like Jonathan's should allow people in academia to eventually
rigorously evaluate which (combinations of) algorithms and architectures
work better than others. There has been plenty of decent and
interesting/innovative work done continuously by the AITP community in this
direction (e.g. [2,3] pointing to nice non-Transformer architectures).
Hopefully, some of the "super excited" mathematicians will eventually
realize that AI/TP is also a science ;-).

Josef

[1] https://github.com/JUrban/LMvsATP
[2] https://arxiv.org/abs/2401.02949
[3] https://arxiv.org/abs/2503.07792

On Wed, Dec 10, 2025, 7:47 PM Lawrence Paulson <lp15@cam.ac.uk> wrote:

Sledgehammer run easily on a laptop and its energy consumption is
essentially zero compared with the amount of energy consumed to train GPT
agents.

Larry
On 10 Dec 2025 at 03:01 +0000, Yakoub Nemouchi <y.nemouchi@gmail.com>,
wrote:

It would be great if someone holding a position at academia does a study
specifically on ITPs. Comparing the amount of energy needed for
Sledgehammer to generate a proof. And the amount of energy consumed to
train GPT agents and prompting them to generate the same proof. And study
the impact of energy consumption from prompting and training GPT agents on
societies, communities and environment.

Best wishes,

Yakoub.
On Tue, Dec 9, 2025 at 10:06 AM Jonathan Julián Huerta y Munive <
jonjulian23@gmail.com> wrote:

Hi Benjamin,

You might want to try the resources in the DeepIsaHOL project
<https://github.com/yonoteam/DeepIsaHOL>.
Currently it has support for ChatGPT, Gemini, Ollama, and HuggingFace
Transformers.
It is very much in a prototypical state and will be upgraded in the
future to adhere to Isabelle standards. However, for the demonstration
purposes you mention, it should suffice. For instance, here is a video
<https://youtu.be/Uq7I5sGaHo0> showing ChatGPT proving simple theorems
in Isabelle. The result is still interactive theorem proving, but the
DeepIsaHOL project has support
<https://ieeexplore.ieee.org/document/11024329> for automating the
manual parts at the end of the video.

Best wishes,
Jonathan


Message-ID: <3ee6efa75b672760cd3484fe2ca903abafddfdd5.camel@tu-berlin.de

Date: Mon, 8 Dec 2025 20:04:32 +0000
From: "Bisping, Benjamin" <benjamin.bisping@tu-berlin.de>
Subject: [isabelle] Copilot (or other programming assistants) for
Isabelle?

Hello everybody,

This week, I plan to convince a computer science audience that
interactive
theorem proving is the future of mathematics and verification. (I'm sure
many
on this mailing list know this situation.) I intend to build some
arguments
around Terence Tao getting ten-thousands of people to watch him use Lean
+Copilot this year<https://www.youtube.com/watch?v=cyyR7j2ChCI>. As I
will
employ Isabelle/HOL for the demonstration, I strongly anticipate
questions in
the direction of how to hook it up with GitHub Copilot.

I tried it myself and Isabelle+Copilot seems highly non-trivial due to
the
incompatibilities introduced by Microsoft on one side and the
Isabelle/Codium
package on the other side.* This is a pity because, in principle, this
year's
LLM coding assistants appear to finally be quite decent at finding
Isabelle
proofs. If I open Isabelle/HOL theories in VS Code, GPT-5.1-Codex can
solve
tricky stuff, for instance, find fitting induction invariants.

But of course, the workflow to co-edit theories between VSCode (or
ChatGPT
promts) and JEdit is too clunky to be proposed in my “sales pitch” for
ITPs.
And if there's no easy automation for writing proofs in Isabelle, but
there is
in Lean, I already know how 100 % of the listeners will decide which of
the
two systems to look into.

I wonder whether people here have positive experiences / suggestions on
LLM-
supported proof generation in Isabelle I could report to advertise it?**

Kind regards,

Ben

  • Prior 2023 hacky solution for Copilot in Isabelle's VS Codium

https://isabelle.systems/zulip-archive/stream/202961-General/topic/Isabelle.2FVSCode.20Github.20Copilot.html
.
The 2025 message on the topic to this mailing list went unanswered
https://lists.cam.ac.uk/sympa/arc/cl-isabelle-users/2025-05/msg00002.html
.
** Prior discussion about the discontinued Isabelle plugin for VS Code,
which
otherwise would at least be some answer.
https://lists.cam.ac.uk/sympa/arc/cl-isabelle-users/2023-10/msg00005.html

--

*Jonathan Julian Huerta y Munive *

Researcher on formal methods and automated reasoning

view this post on Zulip Email Gateway (Dec 10 2025 at 20:52):

From: "\"Lammich, Peter (UT-EEMCS)\"" <cl-isabelle-users@lists.cam.ac.uk>

I've seen ChatGPT hallucinating lemma and function names that do not exist. And coming up with pretty random looking 'proofs', that obviously don't work. Is there a better model for that than GPT?

Peter

On 10 Dec 2025 18:08, Lawrence Paulson <lp15@cam.ac.uk> wrote:
Useful to know. I find them good for finding things in our libraries (both lemmas and obscure syntax). They also generate pretend proofs.

Larry

On 10 Dec 2025, at 14:07, Yakoub Nemouchi <y.nemouchi@gmail.com> wrote:

Not only that, when I prompt them to translate some spec written in natural language to Isabelle/HOL, even if they generate something that works it is always looking clunky, not generic, can be better her and there. Namely, it adds an unnecessary layer of distraction and time wasting for users.

view this post on Zulip Email Gateway (Dec 10 2025 at 20:52):

From: Yakoub Nemouchi <y.nemouchi@gmail.com>

I think this is the right approach. Before introducing such a tool to
student, it is always good to remind the about foundations of computation,
and have the following reminder/disclaimer at the beginning:

“In CS there is a set of problems called undecidable problems that cannot
be automated by any kind of algorithm, whether it is AI based or not. For
example, there is no algorithm that allows to systematically provide a yes
or no answer to all statements written in HOL and automatically know if a
proof exists, i.e., demonstrating automatically if the statement is true or
false for all HOL formulas. The best solution figured so far are ITPs
themselves ( including their backend consisting automated reasoning tools).
Now we have a new expansive toy in town, which is less efficient and way
more taxing for society, environment, and communities. If you guys really
want to use it anyways, here is how. And by the way this is not the only
form of AI that exits, symbolic AI (i.e., automated reasoning tools)
existed for at least the last 50 years.”

On Wed, Dec 10, 2025 at 1:10 PM Josef Urban <josef.urban@gmail.com> wrote:

I said recently in a talk that the ATP and hammer algorithms were
"generative AI" long before LLMs appeared [1]. For someone using a hammer
(or an efficient tactical prover like TacticToe/Tactician) routinely to get
correct proofs, "Chatgpt as a prover" felt like an expensive joke when it
first appeared. (I have started to call Chatgpt a "gateway drug to ML" :).

I would hope the "magic of scaling up" hype is mostly over and even the
LLM/DL community has recently (e.g. with the Deepseek paper) acknowledged
that efficient search/reasoning and combinations of algorithms matter a lot
(cf "agentic AI" as the next emerging buzzword).

Projects like Jonathan's should allow people in academia to eventually
rigorously evaluate which (combinations of) algorithms and architectures
work better than others. There has been plenty of decent and
interesting/innovative work done continuously by the AITP community in this
direction (e.g. [2,3] pointing to nice non-Transformer architectures).
Hopefully, some of the "super excited" mathematicians will eventually
realize that AI/TP is also a science ;-).

Josef

[1] https://github.com/JUrban/LMvsATP
[2] https://arxiv.org/abs/2401.02949
[3] https://arxiv.org/abs/2503.07792

On Wed, Dec 10, 2025, 7:47 PM Lawrence Paulson <lp15@cam.ac.uk> wrote:

Sledgehammer run easily on a laptop and its energy consumption is
essentially zero compared with the amount of energy consumed to train GPT
agents.

Larry
On 10 Dec 2025 at 03:01 +0000, Yakoub Nemouchi <y.nemouchi@gmail.com>,
wrote:

It would be great if someone holding a position at academia does a study
specifically on ITPs. Comparing the amount of energy needed for
Sledgehammer to generate a proof. And the amount of energy consumed to
train GPT agents and prompting them to generate the same proof. And study
the impact of energy consumption from prompting and training GPT agents on
societies, communities and environment.

Best wishes,

Yakoub.
On Tue, Dec 9, 2025 at 10:06 AM Jonathan Julián Huerta y Munive <
jonjulian23@gmail.com> wrote:

Hi Benjamin,

You might want to try the resources in the DeepIsaHOL project
<https://github.com/yonoteam/DeepIsaHOL>.
Currently it has support for ChatGPT, Gemini, Ollama, and HuggingFace
Transformers.
It is very much in a prototypical state and will be upgraded in the
future to adhere to Isabelle standards. However, for the demonstration
purposes you mention, it should suffice. For instance, here is a video
<https://youtu.be/Uq7I5sGaHo0> showing ChatGPT proving simple theorems
in Isabelle. The result is still interactive theorem proving, but the
DeepIsaHOL project has support
<https://ieeexplore.ieee.org/document/11024329> for automating the
manual parts at the end of the video.

Best wishes,
Jonathan


Message-ID: <
3ee6efa75b672760cd3484fe2ca903abafddfdd5.camel@tu-berlin.de>
Date: Mon, 8 Dec 2025 20:04:32 +0000
From: "Bisping, Benjamin" <benjamin.bisping@tu-berlin.de>
Subject: [isabelle] Copilot (or other programming assistants) for
Isabelle?

Hello everybody,

This week, I plan to convince a computer science audience that
interactive
theorem proving is the future of mathematics and verification. (I'm
sure many
on this mailing list know this situation.) I intend to build some
arguments
around Terence Tao getting ten-thousands of people to watch him use Lean
+Copilot this year<https://www.youtube.com/watch?v=cyyR7j2ChCI>. As I
will
employ Isabelle/HOL for the demonstration, I strongly anticipate
questions in
the direction of how to hook it up with GitHub Copilot.

I tried it myself and Isabelle+Copilot seems highly non-trivial due to
the
incompatibilities introduced by Microsoft on one side and the
Isabelle/Codium
package on the other side.* This is a pity because, in principle, this
year's
LLM coding assistants appear to finally be quite decent at finding
Isabelle
proofs. If I open Isabelle/HOL theories in VS Code, GPT-5.1-Codex can
solve
tricky stuff, for instance, find fitting induction invariants.

But of course, the workflow to co-edit theories between VSCode (or
ChatGPT
promts) and JEdit is too clunky to be proposed in my “sales pitch” for
ITPs.
And if there's no easy automation for writing proofs in Isabelle, but
there is
in Lean, I already know how 100 % of the listeners will decide which of
the
two systems to look into.

I wonder whether people here have positive experiences / suggestions on
LLM-
supported proof generation in Isabelle I could report to advertise it?**

Kind regards,

Ben

  • Prior 2023 hacky solution for Copilot in Isabelle's VS Codium

https://isabelle.systems/zulip-archive/stream/202961-General/topic/Isabelle.2FVSCode.20Github.20Copilot.html
.
The 2025 message on the topic to this mailing list went unanswered

https://lists.cam.ac.uk/sympa/arc/cl-isabelle-users/2025-05/msg00002.html
.
** Prior discussion about the discontinued Isabelle plugin for VS Code,
which
otherwise would at least be some answer.

https://lists.cam.ac.uk/sympa/arc/cl-isabelle-users/2023-10/msg00005.html

--

*Jonathan Julian Huerta y Munive *

Researcher on formal methods and automated reasoning

view this post on Zulip Email Gateway (Dec 10 2025 at 21:14):

From: Makarius <makarius@sketis.net>

On 08/12/2025 21:07, Makarius wrote:

better use Lean then.

More side-remarks: The Lean community has enough money and manpower to follow
the latest and greatest trends quickly, even if they turn out dead ends. Thus
it is always interesting to attend the virtual Lean Together event, every year
in January. Non-Lean users are explicitly welcome.

The website for Jan-2026 is here: https://leanprover-community.github.io/lt2026

The past event from Jan-2025 is documented here:
https://leanprover-community.github.io/lt2025/schedule.html

This particular talk from 15-Jan-2025 is relevant to the current mail thread:

"""
Jason Rute (IBM Research)

Title: The last mile: How do we make AI theorem provers which work in the real
world for real users and not just on benchmarks?

Abstract: As AI research progresses, we are seeing massive gains on formal
math benchmarks like MiniF2F and challenges like the IMO. However, these gains
are not directly accessible to the typical ITP user, nor are they used to
build formal mathematical libraries. Is this because the technology is not as
good as promised or because there is a disconnect between research and the
day-to-day needs of users? I propose that even today, we have sufficient AI
technology to enhance the experience of proof assistant users significantly. I
approach this from the lens of product-focused software engineering. How do we
bridge “the last mile” to make a useful product for the end user?
"""

E.g. see this introductory slide "Why don't we have AI assistants for real ITP
users?" https://youtu.be/Yr8dzfVkeHg?t=141

I have attended the talk virtually on that day. To me, the abstract sounded
more optimistic than the impression given in the talk. I have mentally
ticked-off the topic with the note: "Check again in 3-5 years, if they have
something that works properly, or have discovered that the very approach is a
bad idea".

It will be still interesting to see what will be presented in a few weeks on
virtual Lean Together 2026.

Makarius

view this post on Zulip Email Gateway (Dec 10 2025 at 22:00):

From: Yakoub Nemouchi <y.nemouchi@gmail.com>

Peter, if you want something serious I suggest you build your GPT agent
based on Mistral models. I trust tech coming from France because French
education system have a culture of looping in reasoning techniques for
computation in CS class, rather than teaching students engineering
techniques that rely only on brute force computation.

I always keep an eye on tech products coming from researchers or companies
based in France. They always end up finding a balance between powerful
computation and resource efficiency for their products because they loop in
reasoning in their R&D. Maybe my perception is wrong but this is my
personal experience with other tech products.

PS: I am Isabelle user not Coq user. But for example Isabelle/Naproche is
based on CNL model from Franch research institution. I am sure Sledgehammer
leverage a lot from french researchers.

Best wishes,
Yakoub.

On Wed, Dec 10, 2025 at 1:51 PM Lammich, Peter (UT-EEMCS) <
p.lammich@utwente.nl> wrote:

I've seen ChatGPT hallucinating lemma and function names that do not
exist. And coming up with pretty random looking 'proofs', that obviously
don't work. Is there a better model for that than GPT?

Peter

On 10 Dec 2025 18:08, Lawrence Paulson <lp15@cam.ac.uk> wrote:
Useful to know. I find them good for finding things in our libraries (both
lemmas and obscure syntax). They also generate pretend proofs.

Larry

On 10 Dec 2025, at 14:07, Yakoub Nemouchi <y.nemouchi@gmail.com> wrote:

Not only that, when I prompt them to translate some spec written in
natural language to Isabelle/HOL, even if they generate something that
works it is always looking clunky, not generic, can be better her and
there. Namely, it adds an unnecessary layer of distraction and time wasting
for users.

view this post on Zulip Email Gateway (Dec 11 2025 at 00:48):

From: waldinge <cl-isabelle-users@lists.cam.ac.uk>

Plugging my recent paper on extracting programs from automatically discovered proofs. (No LLMs used, but the matter is discussed).
https://arxiv.org/abs/2508.11136

On Dec 10, 2025, at 12:51 PM, Yakoub Nemouchi <y.nemouchi@gmail.com> wrote:

I think this is the right approach. Before introducing such a tool to student, it is always good to remind the about foundations of computation, and have the following reminder/disclaimer at the beginning:

“In CS there is a set of problems called undecidable problems that cannot be automated by any kind of algorithm, whether it is AI based or not. For example, there is no algorithm that allows to systematically provide a yes or no answer to all statements written in HOL and automatically know if a proof exists, i.e., demonstrating automatically if the statement is true or false for all HOL formulas. The best solution figured so far are ITPs themselves ( including their backend consisting automated reasoning tools). Now we have a new expansive toy in town, which is less efficient and way more taxing for society, environment, and communities. If you guys really want to use it anyways, here is how. And by the way this is not the only form of AI that exits, symbolic AI (i.e., automated reasoning tools) existed for at least the last 50 years.”

On Wed, Dec 10, 2025 at 1:10 PM Josef Urban <josef.urban@gmail.com <mailto:josef.urban@gmail.com>> wrote:

I said recently in a talk that the ATP and hammer algorithms were "generative AI" long before LLMs appeared [1]. For someone using a hammer (or an efficient tactical prover like TacticToe/Tactician) routinely to get correct proofs, "Chatgpt as a prover" felt like an expensive joke when it first appeared. (I have started to call Chatgpt a "gateway drug to ML" :).

I would hope the "magic of scaling up" hype is mostly over and even the LLM/DL community has recently (e.g. with the Deepseek paper) acknowledged that efficient search/reasoning and combinations of algorithms matter a lot (cf "agentic AI" as the next emerging buzzword).

Projects like Jonathan's should allow people in academia to eventually rigorously evaluate which (combinations of) algorithms and architectures work better than others. There has been plenty of decent and interesting/innovative work done continuously by the AITP community in this direction (e.g. [2,3] pointing to nice non-Transformer architectures). Hopefully, some of the "super excited" mathematicians will eventually realize that AI/TP is also a science ;-).

Josef

[1] https://github.com/JUrban/LMvsATP
[2] https://arxiv.org/abs/2401.02949
[3] https://arxiv.org/abs/2503.07792

On Wed, Dec 10, 2025, 7:47 PM Lawrence Paulson <lp15@cam.ac.uk <mailto:lp15@cam.ac.uk>> wrote:

Sledgehammer run easily on a laptop and its energy consumption is essentially zero compared with the amount of energy consumed to train GPT agents.

Larry
On 10 Dec 2025 at 03:01 +0000, Yakoub Nemouchi <y.nemouchi@gmail.com <mailto:y.nemouchi@gmail.com>>, wrote:

It would be great if someone holding a position at academia does a study specifically on ITPs. Comparing the amount of energy needed for Sledgehammer to generate a proof. And the amount of energy consumed to train GPT agents and prompting them to generate the same proof. And study the impact of energy consumption from prompting and training GPT agents on societies, communities and environment.

Best wishes,

Yakoub.
On Tue, Dec 9, 2025 at 10:06 AM Jonathan Julián Huerta y Munive <jonjulian23@gmail.com <mailto:jonjulian23@gmail.com>> wrote:

Hi Benjamin,

You might want to try the resources in the DeepIsaHOL project <https://github.com/yonoteam/DeepIsaHOL>.
Currently it has support for ChatGPT, Gemini, Ollama, and HuggingFace Transformers.
It is very much in a prototypical state and will be upgraded in the future to adhere to Isabelle standards. However, for the demonstration purposes you mention, it should suffice. For instance, here is a video <https://youtu.be/Uq7I5sGaHo0> showing ChatGPT proving simple theorems in Isabelle. The result is still interactive theorem proving, but the DeepIsaHOL project has support <https://ieeexplore.ieee.org/document/11024329> for automating the manual parts at the end of the video.

Best wishes,
Jonathan


Message-ID: <3ee6efa75b672760cd3484fe2ca903abafddfdd5.camel@tu-berlin.de <mailto:3ee6efa75b672760cd3484fe2ca903abafddfdd5.camel@tu-berlin.de>>
Date: Mon, 8 Dec 2025 20:04:32 +0000
From: "Bisping, Benjamin" <benjamin.bisping@tu-berlin.de <mailto:benjamin.bisping@tu-berlin.de>>
Subject: [isabelle] Copilot (or other programming assistants) for Isabelle?

Hello everybody,

This week, I plan to convince a computer science audience that interactive
theorem proving is the future of mathematics and verification. (I'm sure many
on this mailing list know this situation.) I intend to build some arguments
around Terence Tao getting ten-thousands of people to watch him use Lean
+Copilot this year<https://www.youtube.com/watch?v=cyyR7j2ChCI>. As I will
employ Isabelle/HOL for the demonstration, I strongly anticipate questions in
the direction of how to hook it up with GitHub Copilot.

I tried it myself and Isabelle+Copilot seems highly non-trivial due to the
incompatibilities introduced by Microsoft on one side and the Isabelle/Codium
package on the other side.* This is a pity because, in principle, this year's
LLM coding assistants appear to finally be quite decent at finding Isabelle
proofs. If I open Isabelle/HOL theories in VS Code, GPT-5.1-Codex can solve
tricky stuff, for instance, find fitting induction invariants.

But of course, the workflow to co-edit theories between VSCode (or ChatGPT
promts) and JEdit is too clunky to be proposed in my “sales pitch” for ITPs.
And if there's no easy automation for writing proofs in Isabelle, but there is
in Lean, I already know how 100 % of the listeners will decide which of the
two systems to look into.

I wonder whether people here have positive experiences / suggestions on LLM-
supported proof generation in Isabelle I could report to advertise it?**

Kind regards,

Ben

>

--

Jonathan Julian Huerta y Munive
Researcher on formal methods and automated reasoning

view this post on Zulip Email Gateway (Dec 11 2025 at 12:22):

From: Yakoub Nemouchi <y.nemouchi@gmail.com>

Josef -

Just for clarification. I am not questioning the value of Jonathan’s work
or others work.

I am just trying to create some awareness for CS community in the middle of
this LLMs noise.

Best,
Yakoub.

On Wed, Dec 10, 2025 at 1:10 PM Josef Urban <josef.urban@gmail.com> wrote:

I said recently in a talk that the ATP and hammer algorithms were
"generative AI" long before LLMs appeared [1]. For someone using a hammer
(or an efficient tactical prover like TacticToe/Tactician) routinely to get
correct proofs, "Chatgpt as a prover" felt like an expensive joke when it
first appeared. (I have started to call Chatgpt a "gateway drug to ML" :).

I would hope the "magic of scaling up" hype is mostly over and even the
LLM/DL community has recently (e.g. with the Deepseek paper) acknowledged
that efficient search/reasoning and combinations of algorithms matter a lot
(cf "agentic AI" as the next emerging buzzword).

Projects like Jonathan's should allow people in academia to eventually
rigorously evaluate which (combinations of) algorithms and architectures
work better than others. There has been plenty of decent and
interesting/innovative work done continuously by the AITP community in this
direction (e.g. [2,3] pointing to nice non-Transformer architectures).
Hopefully, some of the "super excited" mathematicians will eventually
realize that AI/TP is also a science ;-).

Josef

[1] https://github.com/JUrban/LMvsATP
[2] https://arxiv.org/abs/2401.02949
[3] https://arxiv.org/abs/2503.07792

On Wed, Dec 10, 2025, 7:47 PM Lawrence Paulson <lp15@cam.ac.uk> wrote:

Sledgehammer run easily on a laptop and its energy consumption is
essentially zero compared with the amount of energy consumed to train GPT
agents.

Larry
On 10 Dec 2025 at 03:01 +0000, Yakoub Nemouchi <y.nemouchi@gmail.com>,
wrote:

It would be great if someone holding a position at academia does a study
specifically on ITPs. Comparing the amount of energy needed for
Sledgehammer to generate a proof. And the amount of energy consumed to
train GPT agents and prompting them to generate the same proof. And study
the impact of energy consumption from prompting and training GPT agents on
societies, communities and environment.

Best wishes,

Yakoub.
On Tue, Dec 9, 2025 at 10:06 AM Jonathan Julián Huerta y Munive <
jonjulian23@gmail.com> wrote:

Hi Benjamin,

You might want to try the resources in the DeepIsaHOL project
<https://github.com/yonoteam/DeepIsaHOL>.
Currently it has support for ChatGPT, Gemini, Ollama, and HuggingFace
Transformers.
It is very much in a prototypical state and will be upgraded in the
future to adhere to Isabelle standards. However, for the demonstration
purposes you mention, it should suffice. For instance, here is a video
<https://youtu.be/Uq7I5sGaHo0> showing ChatGPT proving simple theorems
in Isabelle. The result is still interactive theorem proving, but the
DeepIsaHOL project has support
<https://ieeexplore.ieee.org/document/11024329> for automating the
manual parts at the end of the video.

Best wishes,
Jonathan


Message-ID: <
3ee6efa75b672760cd3484fe2ca903abafddfdd5.camel@tu-berlin.de>
Date: Mon, 8 Dec 2025 20:04:32 +0000
From: "Bisping, Benjamin" <benjamin.bisping@tu-berlin.de>
Subject: [isabelle] Copilot (or other programming assistants) for
Isabelle?

Hello everybody,

This week, I plan to convince a computer science audience that
interactive
theorem proving is the future of mathematics and verification. (I'm
sure many
on this mailing list know this situation.) I intend to build some
arguments
around Terence Tao getting ten-thousands of people to watch him use Lean
+Copilot this year<https://www.youtube.com/watch?v=cyyR7j2ChCI>. As I
will
employ Isabelle/HOL for the demonstration, I strongly anticipate
questions in
the direction of how to hook it up with GitHub Copilot.

I tried it myself and Isabelle+Copilot seems highly non-trivial due to
the
incompatibilities introduced by Microsoft on one side and the
Isabelle/Codium
package on the other side.* This is a pity because, in principle, this
year's
LLM coding assistants appear to finally be quite decent at finding
Isabelle
proofs. If I open Isabelle/HOL theories in VS Code, GPT-5.1-Codex can
solve
tricky stuff, for instance, find fitting induction invariants.

But of course, the workflow to co-edit theories between VSCode (or
ChatGPT
promts) and JEdit is too clunky to be proposed in my “sales pitch” for
ITPs.
And if there's no easy automation for writing proofs in Isabelle, but
there is
in Lean, I already know how 100 % of the listeners will decide which of
the
two systems to look into.

I wonder whether people here have positive experiences / suggestions on
LLM-
supported proof generation in Isabelle I could report to advertise it?**

Kind regards,

Ben

  • Prior 2023 hacky solution for Copilot in Isabelle's VS Codium

https://isabelle.systems/zulip-archive/stream/202961-General/topic/Isabelle.2FVSCode.20Github.20Copilot.html
.
The 2025 message on the topic to this mailing list went unanswered

https://lists.cam.ac.uk/sympa/arc/cl-isabelle-users/2025-05/msg00002.html
.
** Prior discussion about the discontinued Isabelle plugin for VS Code,
which
otherwise would at least be some answer.

https://lists.cam.ac.uk/sympa/arc/cl-isabelle-users/2023-10/msg00005.html

--

*Jonathan Julian Huerta y Munive *

Researcher on formal methods and automated reasoning

view this post on Zulip Email Gateway (Dec 11 2025 at 15:18):

From: Harry Butterworth <heb1001@gmail.com>

I think the main thing missing from historical AI approaches in general is
the ability to create higher level abstractions from observations.

IMO, creating higher level abstractions is a significant part of what maths
is about: observe a pattern, make a formal statement of the pattern, search
for a proof to determine if the pattern always holds.

If you consider that mathematical expressions and statements are enumerable
then the interesting proofs are those that appear on the shortest paths
linking nodes of the tree of the enumeration where proofs linking nodes
closer to the root of the tree are more interesting and useful (because
they are more generic) than proofs linking nodes at more remote levels
(where they are more specific).

Taking Euler’s identity as an example: it’s a statement that sits near the
root of the tree (because it’s short) that links expressions that also sit
near the root of the tree (because they are short too) and it’s an
abstraction — a shortcut — for its own proof which, in contrast, is a more
convoluted path that climbs high up into the branches of the tree before
coming back down again.

The “noise” about LLMs has arisen because they are demonstrating the
ability to create higher level abstractions from observations (the training
data) during self-supervised learning without requiring explicit human
input. This is abductive inference and it’s something that historical
approaches can’t do (AFAIK).

LLMs are not necessarily the best architecture for abductive inference and
the scaffolding around current LLMs is not necessarily particularly
appropriate for doing mathematics but LLMs do represent a step change in
the right direction in terms of capability.

Undecidability and Godel’s theorems do not restrict computers from becoming
superhuman at mathematics (even if you believe in some kind of substrate
dependence, which I don’t, it is possible to create biological computers).

Another way of looking at Godel’s theorems is that every undecidable
proposition creates two new branches within mathematics and that this
branching tree is proved to be infinite. You wouldn’t want it any other
way as there is an isomorphism between reality and a mathematical
description of reality and who would choose to live in a finite universe?

In fact, if you subscribe to the simulation hypothesis (and consequentially
the dual aspect theory of information), the entire universe may simply be
the internal perspective of a theorem prover currently working on a
statement which is isomorphic to me winning this argument about the
significance of LLMs. A statement somewhere almost inconceivably high up
in the branches of highly specific and asymptotically useless mathematics.

Enjoy!

On Thu, 11 Dec 2025 at 12:22, Yakoub Nemouchi <y.nemouchi@gmail.com> wrote:

Josef -

Just for clarification. I am not questioning the value of Jonathan’s work
or others work.

I am just trying to create some awareness for CS community in the middle
of this LLMs noise.

Best,
Yakoub.

On Wed, Dec 10, 2025 at 1:10 PM Josef Urban <josef.urban@gmail.com> wrote:

I said recently in a talk that the ATP and hammer algorithms were
"generative AI" long before LLMs appeared [1]. For someone using a hammer
(or an efficient tactical prover like TacticToe/Tactician) routinely to get
correct proofs, "Chatgpt as a prover" felt like an expensive joke when it
first appeared. (I have started to call Chatgpt a "gateway drug to ML" :).

I would hope the "magic of scaling up" hype is mostly over and even the
LLM/DL community has recently (e.g. with the Deepseek paper) acknowledged
that efficient search/reasoning and combinations of algorithms matter a lot
(cf "agentic AI" as the next emerging buzzword).

Projects like Jonathan's should allow people in academia to eventually
rigorously evaluate which (combinations of) algorithms and architectures
work better than others. There has been plenty of decent and
interesting/innovative work done continuously by the AITP community in this
direction (e.g. [2,3] pointing to nice non-Transformer architectures).
Hopefully, some of the "super excited" mathematicians will eventually
realize that AI/TP is also a science ;-).

Josef

[1] https://github.com/JUrban/LMvsATP
[2] https://arxiv.org/abs/2401.02949
[3] https://arxiv.org/abs/2503.07792

On Wed, Dec 10, 2025, 7:47 PM Lawrence Paulson <lp15@cam.ac.uk> wrote:

Sledgehammer run easily on a laptop and its energy consumption is
essentially zero compared with the amount of energy consumed to train GPT
agents.

Larry
On 10 Dec 2025 at 03:01 +0000, Yakoub Nemouchi <y.nemouchi@gmail.com>,
wrote:

It would be great if someone holding a position at academia does a study
specifically on ITPs. Comparing the amount of energy needed for
Sledgehammer to generate a proof. And the amount of energy consumed to
train GPT agents and prompting them to generate the same proof. And study
the impact of energy consumption from prompting and training GPT agents on
societies, communities and environment.

Best wishes,

Yakoub.
On Tue, Dec 9, 2025 at 10:06 AM Jonathan Julián Huerta y Munive <
jonjulian23@gmail.com> wrote:

Hi Benjamin,

You might want to try the resources in the DeepIsaHOL project
<https://github.com/yonoteam/DeepIsaHOL>.
Currently it has support for ChatGPT, Gemini, Ollama, and HuggingFace
Transformers.
It is very much in a prototypical state and will be upgraded in the
future to adhere to Isabelle standards. However, for the demonstration
purposes you mention, it should suffice. For instance, here is a video
<https://youtu.be/Uq7I5sGaHo0> showing ChatGPT proving simple theorems
in Isabelle. The result is still interactive theorem proving, but the
DeepIsaHOL project has support
<https://ieeexplore.ieee.org/document/11024329> for automating the
manual parts at the end of the video.

Best wishes,
Jonathan


Message-ID: <
3ee6efa75b672760cd3484fe2ca903abafddfdd5.camel@tu-berlin.de>
Date: Mon, 8 Dec 2025 20:04:32 +0000
From: "Bisping, Benjamin" <benjamin.bisping@tu-berlin.de>
Subject: [isabelle] Copilot (or other programming assistants) for
Isabelle?

Hello everybody,

This week, I plan to convince a computer science audience that
interactive
theorem proving is the future of mathematics and verification. (I'm
sure many
on this mailing list know this situation.) I intend to build some
arguments
around Terence Tao getting ten-thousands of people to watch him use
Lean
+Copilot this year<https://www.youtube.com/watch?v=cyyR7j2ChCI>. As I
will
employ Isabelle/HOL for the demonstration, I strongly anticipate
questions in
the direction of how to hook it up with GitHub Copilot.

I tried it myself and Isabelle+Copilot seems highly non-trivial due to
the
incompatibilities introduced by Microsoft on one side and the
Isabelle/Codium
package on the other side.* This is a pity because, in principle, this
year's
LLM coding assistants appear to finally be quite decent at finding
Isabelle
proofs. If I open Isabelle/HOL theories in VS Code, GPT-5.1-Codex can
solve
tricky stuff, for instance, find fitting induction invariants.

But of course, the workflow to co-edit theories between VSCode (or
ChatGPT
promts) and JEdit is too clunky to be proposed in my “sales pitch” for
ITPs.
And if there's no easy automation for writing proofs in Isabelle, but
there is
in Lean, I already know how 100 % of the listeners will decide which
of the
two systems to look into.

I wonder whether people here have positive experiences / suggestions
on LLM-
supported proof generation in Isabelle I could report to advertise
it?**

Kind regards,

Ben

  • Prior 2023 hacky solution for Copilot in Isabelle's VS Codium

https://isabelle.systems/zulip-archive/stream/202961-General/topic/Isabelle.2FVSCode.20Github.20Copilot.html
.
The 2025 message on the topic to this mailing list went unanswered

https://lists.cam.ac.uk/sympa/arc/cl-isabelle-users/2025-05/msg00002.html
.
** Prior discussion about the discontinued Isabelle plugin for VS
Code, which
otherwise would at least be some answer.

https://lists.cam.ac.uk/sympa/arc/cl-isabelle-users/2023-10/msg00005.html

--

*Jonathan Julian Huerta y Munive *

Researcher on formal methods and automated reasoning

view this post on Zulip Email Gateway (Feb 25 2026 at 12:01):

From: Harry Butterworth <heb1001@gmail.com>

The last couple of paragraphs in my previous email were a bit happy-clappy
so I spent a little time trying to reconcile them with quantum mechanics
and general relativity.

I read Tegmark’s “Our mathematical universe” which contains a significant
amount of what follows, and I’m aware of Wolfram’s Ruliad (which obviously
contains everything, somewhere).

I wrote some draft emails with the resulting ideas but eventually just fed
them to Gemini to try to get it to roll them into something coherent.

Gemini goes off the rails a bit in places and, unsurprisingly, there’s a
fair amount of questionable BS. I pushed back a little at the end.
Overall, I think it did a reasonable job of turning my input into something
less opaque.

At least, it serves as an example of the need for a significant amount of
work after abductive inference to determine whether you have anything of
value.

https://share.google/aimode/lVcN4fqShhzve2PCz

Harry

On Thu, 11 Dec 2025 at 15:18, Harry Butterworth <heb1001@gmail.com> wrote:

I think the main thing missing from historical AI approaches in general is
the ability to create higher level abstractions from observations.

IMO, creating higher level abstractions is a significant part of what
maths is about: observe a pattern, make a formal statement of the pattern,
search for a proof to determine if the pattern always holds.

If you consider that mathematical expressions and statements are
enumerable then the interesting proofs are those that appear on the
shortest paths linking nodes of the tree of the enumeration where proofs
linking nodes closer to the root of the tree are more interesting and
useful (because they are more generic) than proofs linking nodes at more
remote levels (where they are more specific).

Taking Euler’s identity as an example: it’s a statement that sits near the
root of the tree (because it’s short) that links expressions that also sit
near the root of the tree (because they are short too) and it’s an
abstraction — a shortcut — for its own proof which, in contrast, is a more
convoluted path that climbs high up into the branches of the tree before
coming back down again.

The “noise” about LLMs has arisen because they are demonstrating the
ability to create higher level abstractions from observations (the training
data) during self-supervised learning without requiring explicit human
input. This is abductive inference and it’s something that historical
approaches can’t do (AFAIK).

LLMs are not necessarily the best architecture for abductive inference and
the scaffolding around current LLMs is not necessarily particularly
appropriate for doing mathematics but LLMs do represent a step change in
the right direction in terms of capability.

Undecidability and Godel’s theorems do not restrict computers from
becoming superhuman at mathematics (even if you believe in some kind of
substrate dependence, which I don’t, it is possible to create biological
computers).

Another way of looking at Godel’s theorems is that every undecidable
proposition creates two new branches within mathematics and that this
branching tree is proved to be infinite. You wouldn’t want it any other
way as there is an isomorphism between reality and a mathematical
description of reality and who would choose to live in a finite universe?

In fact, if you subscribe to the simulation hypothesis (and
consequentially the dual aspect theory of information), the entire universe
may simply be the internal perspective of a theorem prover currently
working on a statement which is isomorphic to me winning this argument
about the significance of LLMs. A statement somewhere almost inconceivably
high up in the branches of highly specific and asymptotically useless
mathematics.

Enjoy!

On Thu, 11 Dec 2025 at 12:22, Yakoub Nemouchi <y.nemouchi@gmail.com>
wrote:

Josef -

Just for clarification. I am not questioning the value of Jonathan’s work
or others work.

I am just trying to create some awareness for CS community in the middle
of this LLMs noise.

Best,
Yakoub.

On Wed, Dec 10, 2025 at 1:10 PM Josef Urban <josef.urban@gmail.com>
wrote:

I said recently in a talk that the ATP and hammer algorithms were
"generative AI" long before LLMs appeared [1]. For someone using a hammer
(or an efficient tactical prover like TacticToe/Tactician) routinely to get
correct proofs, "Chatgpt as a prover" felt like an expensive joke when it
first appeared. (I have started to call Chatgpt a "gateway drug to ML" :).

I would hope the "magic of scaling up" hype is mostly over and even the
LLM/DL community has recently (e.g. with the Deepseek paper) acknowledged
that efficient search/reasoning and combinations of algorithms matter a lot
(cf "agentic AI" as the next emerging buzzword).

Projects like Jonathan's should allow people in academia to eventually
rigorously evaluate which (combinations of) algorithms and architectures
work better than others. There has been plenty of decent and
interesting/innovative work done continuously by the AITP community in this
direction (e.g. [2,3] pointing to nice non-Transformer architectures).
Hopefully, some of the "super excited" mathematicians will eventually
realize that AI/TP is also a science ;-).

Josef

[1] https://github.com/JUrban/LMvsATP
[2] https://arxiv.org/abs/2401.02949
[3] https://arxiv.org/abs/2503.07792

On Wed, Dec 10, 2025, 7:47 PM Lawrence Paulson <lp15@cam.ac.uk> wrote:

Sledgehammer run easily on a laptop and its energy consumption is
essentially zero compared with the amount of energy consumed to train GPT
agents.

Larry
On 10 Dec 2025 at 03:01 +0000, Yakoub Nemouchi <y.nemouchi@gmail.com>,
wrote:

It would be great if someone holding a position at academia does a
study specifically on ITPs. Comparing the amount of energy needed for
Sledgehammer to generate a proof. And the amount of energy consumed to
train GPT agents and prompting them to generate the same proof. And study
the impact of energy consumption from prompting and training GPT agents on
societies, communities and environment.

Best wishes,

Yakoub.
On Tue, Dec 9, 2025 at 10:06 AM Jonathan Julián Huerta y Munive <
jonjulian23@gmail.com> wrote:

Hi Benjamin,

You might want to try the resources in the DeepIsaHOL project
<https://github.com/yonoteam/DeepIsaHOL>.
Currently it has support for ChatGPT, Gemini, Ollama, and HuggingFace
Transformers.
It is very much in a prototypical state and will be upgraded in the
future to adhere to Isabelle standards. However, for the demonstration
purposes you mention, it should suffice. For instance, here is a video
<https://youtu.be/Uq7I5sGaHo0> showing ChatGPT proving simple
theorems in Isabelle. The result is still interactive theorem proving, but
the DeepIsaHOL project has support
<https://ieeexplore.ieee.org/document/11024329> for automating the
manual parts at the end of the video.

Best wishes,
Jonathan


Message-ID: <
3ee6efa75b672760cd3484fe2ca903abafddfdd5.camel@tu-berlin.de>
Date: Mon, 8 Dec 2025 20:04:32 +0000
From: "Bisping, Benjamin" <benjamin.bisping@tu-berlin.de>
Subject: [isabelle] Copilot (or other programming assistants) for
Isabelle?

Hello everybody,

This week, I plan to convince a computer science audience that
interactive
theorem proving is the future of mathematics and verification. (I'm
sure many
on this mailing list know this situation.) I intend to build some
arguments
around Terence Tao getting ten-thousands of people to watch him use
Lean
+Copilot this year<https://www.youtube.com/watch?v=cyyR7j2ChCI>. As
I will
employ Isabelle/HOL for the demonstration, I strongly anticipate
questions in
the direction of how to hook it up with GitHub Copilot.

I tried it myself and Isabelle+Copilot seems highly non-trivial due
to the
incompatibilities introduced by Microsoft on one side and the
Isabelle/Codium
package on the other side.* This is a pity because, in principle,
this year's
LLM coding assistants appear to finally be quite decent at finding
Isabelle
proofs. If I open Isabelle/HOL theories in VS Code, GPT-5.1-Codex can
solve
tricky stuff, for instance, find fitting induction invariants.

But of course, the workflow to co-edit theories between VSCode (or
ChatGPT
promts) and JEdit is too clunky to be proposed in my “sales pitch”
for ITPs.
And if there's no easy automation for writing proofs in Isabelle, but
there is
in Lean, I already know how 100 % of the listeners will decide which
of the
two systems to look into.

I wonder whether people here have positive experiences / suggestions
on LLM-
supported proof generation in Isabelle I could report to advertise
it?**

Kind regards,

Ben

  • Prior 2023 hacky solution for Copilot in Isabelle's VS Codium

https://isabelle.systems/zulip-archive/stream/202961-General/topic/Isabelle.2FVSCode.20Github.20Copilot.html
.
The 2025 message on the topic to this mailing list went unanswered

https://lists.cam.ac.uk/sympa/arc/cl-isabelle-users/2025-05/msg00002.html
.
** Prior discussion about the discontinued Isabelle plugin for VS
Code, which

[message truncated]

view this post on Zulip Email Gateway (Feb 26 2026 at 16:31):

From: Harry Butterworth <heb1001@gmail.com>

I pursued this a little with Gemini and managed to fix some of the more
obvious BS and get it to make some “testable predictions” :joy:

https://share.google/aimode/vEVuuSa9eCX4Qo8C8

tl;dr: Pity the poor physicists for they work the face of an unyielding
“category error”. Worship the logicians for you are the gods of your
fractal domain.

On Wed, 25 Feb 2026 at 12:01, Harry Butterworth <heb1001@gmail.com> wrote:

The last couple of paragraphs in my previous email were a bit happy-clappy
so I spent a little time trying to reconcile them with quantum mechanics
and general relativity.

I read Tegmark’s “Our mathematical universe” which contains a significant
amount of what follows, and I’m aware of Wolfram’s Ruliad (which obviously
contains everything, somewhere).

I wrote some draft emails with the resulting ideas but eventually just fed
them to Gemini to try to get it to roll them into something coherent.

Gemini goes off the rails a bit in places and, unsurprisingly, there’s a
fair amount of questionable BS. I pushed back a little at the end.
Overall, I think it did a reasonable job of turning my input into something
less opaque.

At least, it serves as an example of the need for a significant amount of
work after abductive inference to determine whether you have anything of
value.

https://share.google/aimode/lVcN4fqShhzve2PCz

Harry

On Thu, 11 Dec 2025 at 15:18, Harry Butterworth <heb1001@gmail.com> wrote:

I think the main thing missing from historical AI approaches in general
is the ability to create higher level abstractions from observations.

IMO, creating higher level abstractions is a significant part of what
maths is about: observe a pattern, make a formal statement of the pattern,
search for a proof to determine if the pattern always holds.

If you consider that mathematical expressions and statements are
enumerable then the interesting proofs are those that appear on the
shortest paths linking nodes of the tree of the enumeration where proofs
linking nodes closer to the root of the tree are more interesting and
useful (because they are more generic) than proofs linking nodes at more
remote levels (where they are more specific).

Taking Euler’s identity as an example: it’s a statement that sits near
the root of the tree (because it’s short) that links expressions that also
sit near the root of the tree (because they are short too) and it’s an
abstraction — a shortcut — for its own proof which, in contrast, is a more
convoluted path that climbs high up into the branches of the tree before
coming back down again.

The “noise” about LLMs has arisen because they are demonstrating the
ability to create higher level abstractions from observations (the training
data) during self-supervised learning without requiring explicit human
input. This is abductive inference and it’s something that historical
approaches can’t do (AFAIK).

LLMs are not necessarily the best architecture for abductive inference
and the scaffolding around current LLMs is not necessarily particularly
appropriate for doing mathematics but LLMs do represent a step change in
the right direction in terms of capability.

Undecidability and Godel’s theorems do not restrict computers from
becoming superhuman at mathematics (even if you believe in some kind of
substrate dependence, which I don’t, it is possible to create biological
computers).

Another way of looking at Godel’s theorems is that every undecidable
proposition creates two new branches within mathematics and that this
branching tree is proved to be infinite. You wouldn’t want it any other
way as there is an isomorphism between reality and a mathematical
description of reality and who would choose to live in a finite universe?

In fact, if you subscribe to the simulation hypothesis (and
consequentially the dual aspect theory of information), the entire universe
may simply be the internal perspective of a theorem prover currently
working on a statement which is isomorphic to me winning this argument
about the significance of LLMs. A statement somewhere almost inconceivably
high up in the branches of highly specific and asymptotically useless
mathematics.

Enjoy!

On Thu, 11 Dec 2025 at 12:22, Yakoub Nemouchi <y.nemouchi@gmail.com>
wrote:

Josef -

Just for clarification. I am not questioning the value of Jonathan’s
work or others work.

I am just trying to create some awareness for CS community in the middle
of this LLMs noise.

Best,
Yakoub.

On Wed, Dec 10, 2025 at 1:10 PM Josef Urban <josef.urban@gmail.com>
wrote:

I said recently in a talk that the ATP and hammer algorithms were
"generative AI" long before LLMs appeared [1]. For someone using a hammer
(or an efficient tactical prover like TacticToe/Tactician) routinely to get
correct proofs, "Chatgpt as a prover" felt like an expensive joke when it
first appeared. (I have started to call Chatgpt a "gateway drug to ML" :).

I would hope the "magic of scaling up" hype is mostly over and even the
LLM/DL community has recently (e.g. with the Deepseek paper) acknowledged
that efficient search/reasoning and combinations of algorithms matter a lot
(cf "agentic AI" as the next emerging buzzword).

Projects like Jonathan's should allow people in academia to eventually
rigorously evaluate which (combinations of) algorithms and architectures
work better than others. There has been plenty of decent and
interesting/innovative work done continuously by the AITP community in this
direction (e.g. [2,3] pointing to nice non-Transformer architectures).
Hopefully, some of the "super excited" mathematicians will eventually
realize that AI/TP is also a science ;-).

Josef

[1] https://github.com/JUrban/LMvsATP
[2] https://arxiv.org/abs/2401.02949
[3] https://arxiv.org/abs/2503.07792

On Wed, Dec 10, 2025, 7:47 PM Lawrence Paulson <lp15@cam.ac.uk> wrote:

Sledgehammer run easily on a laptop and its energy consumption is
essentially zero compared with the amount of energy consumed to train GPT
agents.

Larry
On 10 Dec 2025 at 03:01 +0000, Yakoub Nemouchi <y.nemouchi@gmail.com>,
wrote:

It would be great if someone holding a position at academia does a
study specifically on ITPs. Comparing the amount of energy needed for
Sledgehammer to generate a proof. And the amount of energy consumed to
train GPT agents and prompting them to generate the same proof. And study
the impact of energy consumption from prompting and training GPT agents on
societies, communities and environment.

Best wishes,

Yakoub.
On Tue, Dec 9, 2025 at 10:06 AM Jonathan Julián Huerta y Munive <
jonjulian23@gmail.com> wrote:

Hi Benjamin,

You might want to try the resources in the DeepIsaHOL project
<https://github.com/yonoteam/DeepIsaHOL>.
Currently it has support for ChatGPT, Gemini, Ollama, and HuggingFace
Transformers.
It is very much in a prototypical state and will be upgraded in the
future to adhere to Isabelle standards. However, for the demonstration
purposes you mention, it should suffice. For instance, here is a
video <https://youtu.be/Uq7I5sGaHo0> showing ChatGPT proving simple
theorems in Isabelle. The result is still interactive theorem proving, but
the DeepIsaHOL project has support
<https://ieeexplore.ieee.org/document/11024329> for automating the
manual parts at the end of the video.

Best wishes,
Jonathan


Message-ID: <
3ee6efa75b672760cd3484fe2ca903abafddfdd5.camel@tu-berlin.de>
Date: Mon, 8 Dec 2025 20:04:32 +0000
From: "Bisping, Benjamin" <benjamin.bisping@tu-berlin.de>
Subject: [isabelle] Copilot (or other programming assistants) for
Isabelle?

Hello everybody,

This week, I plan to convince a computer science audience that
interactive
theorem proving is the future of mathematics and verification. (I'm
sure many
on this mailing list know this situation.) I intend to build some
arguments
around Terence Tao getting ten-thousands of people to watch him use
Lean
+Copilot this year<https://www.youtube.com/watch?v=cyyR7j2ChCI>. As
I will
employ Isabelle/HOL for the demonstration, I strongly anticipate
questions in
the direction of how to hook it up with GitHub Copilot.

I tried it myself and Isabelle+Copilot seems highly non-trivial due
to the
incompatibilities introduced by Microsoft on one side and the
Isabelle/Codium
package on the other side.* This is a pity because, in principle,
this year's
LLM coding assistants appear to finally be quite decent at finding
Isabelle
proofs. If I open Isabelle/HOL theories in VS Code, GPT-5.1-Codex
can solve
tricky stuff, for instance, find fitting induction invariants.

But of course, the workflow to co-edit theories between VSCode (or
ChatGPT
promts) and JEdit is too clunky to be proposed in my “sales pitch”
for ITPs.
And if there's no easy automation for writing proofs in Isabelle,
but there is
in Lean, I already know how 100 % of the listeners will decide which
of the
two systems to look into.

I wonder whether people here have positive experi
[message truncated]

view this post on Zulip Email Gateway (Mar 14 2026 at 16:31):

From: Yakoub Nemouchi <y.nemouchi@gmail.com>

I think after using Claude Opus 4.6 for a couple of weeks I can see how
LLMs are useful. However, let's not forget they are still very taxing to
the environment and communities *see below (these numbers do not include
the most powerful models)*.

My point during this exchange was, LLMs will not automate everything, and
of course they cost a lot. And so existing automation tools are already way
better given the cost metrics below.
And I really hope someone from the CS community tries to optimize the cost
metrics shared in the table below. And make those models "trainable" on a
single GPU or something.

However, I never said they are useless. What I found is LLMs are great
multipliers for the process of knowledge creation. In my opinion they are
way more useful to accelerate processes such as the process of knowledge
creation as opposed to using them as an "automation tool".

It has a similar job as an ITP, namely, it allows faster access to
knowledge by exposing the trivial, the tedious, and the straightforward,
and leaves researchers focusing on more complicated problems.
The reason I use an ITP to reason about computation is because ITPs remove
the trivial, the tedious, and the straightforward, and expose to me the
interesting proof goals.

For LLMs, for example, I needed to learn about "Real Analysis". Usually to
learn about this topic, I needed to spend a couple of weeks to a couple of
months looking at books, online courses, etc., to learn about this topic
area.
Now with LLMs I can get everything I need with refs in a few hours. Why?
Because the knowledge around Real Analysis was dumped into the knowledge
graph of the LLM. Since "Real Analysis" is a well studied topic, it became
a distraction for my end research goal, and using LLms to have faster
access to this kind of knowledge removed this distraction and it became
just a tool that I needed to achieve my end research goal.

For people interested in music, LLMs play a similar role as Pro Tools. Pro
Tools removed all things that get in the way of the process of creation
when using track machines!

Somehow the LLM removed that burden of chasing knowledge that is very well
studied, and gave me faster access to this knowledge. And so allowed me to
move faster towards my end research goal. The latter is where the new
knowledge is created, therefore the most important part of the process.
If you think of the job of an academic is to focus on knowledge creation,
LLMs are his best friend in such a process.

Similarly during the proof search process in an ITP, some proof steps are
related to certain theories that maybe our brain forgets about them, if the
LLM is trained on that theory (in any other language), the LLM can help you
discharging those proof steps faster. It does not mean full automation, in
fact the LLM can propose informal proof steps in natural language and then
the proof engineer can translate it to the language of its ITP.

Conclusions: LLMs are great multipliers and accelerators for the process of
knowledge creation.

PS: Note that there is a before and after Claude Opus 4.6. This model has
something into it that allows for doing reasoning through small
computational steps (after all maybe what we call reasoning is maybe just
vector gradient computation at large scale in our brain). Previous models
did not have this!

Best wishes,
Yakoub.

Financial, Power, and Water Training Costs by Model Family
Model Family Est. Training Cost (USD) Est. Power Consumption Est. Water
Consumption
OpenAI GPT-4 $100M+ ~50 - 62 GWh ~100M+ Liters (Indirect/Cooling)
OpenAI GPT-3 ~$4.6M (2020 cost) ~1.29 GWh ~700,000 Liters
OpenAI GPT-4o N/A ~391 GWh (Annual Estimate) N/A
Google Gemini $30M – $192M ~50 GWh+ (Ultra estimate) Millions of Liters
Meta Llama 3 $100M – $170M+ ~400 MWh – 2 GWh 22 Million Liters
Anthropic Claude 3 $100M – $160M+ High GWh scale (Opus) Tens of Millions of
Liters

On Thu, Feb 26, 2026 at 9:30 AM Harry Butterworth <heb1001@gmail.com> wrote:

I pursued this a little with Gemini and managed to fix some of the more
obvious BS and get it to make some “testable predictions” :joy:

https://share.google/aimode/vEVuuSa9eCX4Qo8C8

tl;dr: Pity the poor physicists for they work the face of an unyielding
“category error”. Worship the logicians for you are the gods of your
fractal domain.

On Wed, 25 Feb 2026 at 12:01, Harry Butterworth <heb1001@gmail.com> wrote:

The last couple of paragraphs in my previous email were a bit
happy-clappy so I spent a little time trying to reconcile them with quantum
mechanics and general relativity.

I read Tegmark’s “Our mathematical universe” which contains a significant
amount of what follows, and I’m aware of Wolfram’s Ruliad (which obviously
contains everything, somewhere).

I wrote some draft emails with the resulting ideas but eventually just
fed them to Gemini to try to get it to roll them into something coherent.

Gemini goes off the rails a bit in places and, unsurprisingly, there’s a
fair amount of questionable BS. I pushed back a little at the end.
Overall, I think it did a reasonable job of turning my input into something
less opaque.

At least, it serves as an example of the need for a significant amount of
work after abductive inference to determine whether you have anything of
value.

https://share.google/aimode/lVcN4fqShhzve2PCz

Harry

On Thu, 11 Dec 2025 at 15:18, Harry Butterworth <heb1001@gmail.com>
wrote:

I think the main thing missing from historical AI approaches in general
is the ability to create higher level abstractions from observations.

IMO, creating higher level abstractions is a significant part of what
maths is about: observe a pattern, make a formal statement of the pattern,
search for a proof to determine if the pattern always holds.

If you consider that mathematical expressions and statements are
enumerable then the interesting proofs are those that appear on the
shortest paths linking nodes of the tree of the enumeration where proofs
linking nodes closer to the root of the tree are more interesting and
useful (because they are more generic) than proofs linking nodes at more
remote levels (where they are more specific).

Taking Euler’s identity as an example: it’s a statement that sits near
the root of the tree (because it’s short) that links expressions that also
sit near the root of the tree (because they are short too) and it’s an
abstraction — a shortcut — for its own proof which, in contrast, is a more
convoluted path that climbs high up into the branches of the tree before
coming back down again.

The “noise” about LLMs has arisen because they are demonstrating the
ability to create higher level abstractions from observations (the training
data) during self-supervised learning without requiring explicit human
input. This is abductive inference and it’s something that historical
approaches can’t do (AFAIK).

LLMs are not necessarily the best architecture for abductive inference
and the scaffolding around current LLMs is not necessarily particularly
appropriate for doing mathematics but LLMs do represent a step change in
the right direction in terms of capability.

Undecidability and Godel’s theorems do not restrict computers from
becoming superhuman at mathematics (even if you believe in some kind of
substrate dependence, which I don’t, it is possible to create biological
computers).

Another way of looking at Godel’s theorems is that every undecidable
proposition creates two new branches within mathematics and that this
branching tree is proved to be infinite. You wouldn’t want it any other
way as there is an isomorphism between reality and a mathematical
description of reality and who would choose to live in a finite universe?

In fact, if you subscribe to the simulation hypothesis (and
consequentially the dual aspect theory of information), the entire universe
may simply be the internal perspective of a theorem prover currently
working on a statement which is isomorphic to me winning this argument
about the significance of LLMs. A statement somewhere almost inconceivably
high up in the branches of highly specific and asymptotically useless
mathematics.

Enjoy!

On Thu, 11 Dec 2025 at 12:22, Yakoub Nemouchi <y.nemouchi@gmail.com>
wrote:

Josef -

Just for clarification. I am not questioning the value of Jonathan’s
work or others work.

I am just trying to create some awareness for CS community in the
middle of this LLMs noise.

Best,
Yakoub.

On Wed, Dec 10, 2025 at 1:10 PM Josef Urban <josef.urban@gmail.com>
wrote:

I said recently in a talk that the ATP and hammer algorithms were
"generative AI" long before LLMs appeared [1]. For someone using a hammer
(or an efficient tactical prover like TacticToe/Tactician) routinely to get
correct proofs, "Chatgpt as a prover" felt like an expensive joke when it
first appeared. (I have started to call Chatgpt a "gateway drug to ML" :).

I would hope the "magic of scaling up" hype is mostly over and even
the LLM/DL community has recently (e.g. with the Deepseek paper)
acknowledged that efficient search/reasoning and combinations of algorithms
matter a lot (cf "agentic AI" as the next emerging buzzword).

Projects like Jonathan's should allow people in academia to eventually
rigorously evaluate which (combinations of) algorithms and architectures
work better than others. There has been plenty of decent and
interesting/innovative work done continuously by the AITP community in this
direction (e.g. [2,3] pointi
[message truncated]

view this post on Zulip Email Gateway (Mar 16 2026 at 13:49):

From: Harry Butterworth <heb1001@gmail.com>

There’s clearly a synergy between abductive inference and deductive
reasoning and when you consider that abductive inference is _inherently
unsound_ it becomes obvious that most formal methods tools can’t do it _by
design_.

I spent a little more time on my theorem prover universe using _more
powerful_ AIs (qwen3-coder-next and qwen3.5) and found they get very
excited about it and volunteer to generate a lot of maths.

I asked them to derive whatever they were interested in and they repeatedly
say the theory is very fertile and come up with a lot of surprising stuff
including the shape of dark matter halos, predicting cosmic voids around
observers, predictions about the cosmic web…

The AIs adjust their intelligence to be consistent with the user (I guess
they model conversations between peers). I found it’s best to say as little
as possible about the topic or they will dumb themselves down to my level!
My prompts are mostly explaining the rationale for what I’m asking them to
do and encouraging them to be intellectually honest. I had to tell qwen3.5
I could see what it was thinking (true) so it should stop trying to fake
the results!

I pushed the AIs to review each other's work to get a theory development
(review appended below). I can’t tell how much of the maths/physics is BS,
it’s well beyond my understanding. I'm going to try Claude on it with
Isabelle (this is the synergy of abductive and deductive reasoning).

For the time being, I’ll try to write up the interesting bits here.

I only fed a few ideas into this development, the AIs came up with the
rest. This was what I remember of my input:

1) QM looks a bit like the internal perspective of a subproblem of a
dynamic programming framework.
2) In a full expansion of a Turing complete rule set there will be theorem
provers using dynamic programming to prove the whole of mathematics.
3) If the universe is based on rules then there will be an isomorphism of
it somewhere in a full expansion of mathematics.
4) We should expect to find ourselves in a universe generated by the most
efficient generator of observers like us.
5) Dynamic programming is the most efficient way to solve problems with
overlapping subproblems with optimal substructure.
6) The proof of the whole of maths is one such problem.
7) So, there’s a chance our universe might look like the inside perspective
of a theorem prover using a dynamic programming approach and proving the
whole of mathematics.
8) The AI wanted to complicate the branch parity to get phase immediately.
I told it to stick to -1 and +1 (parsimony) and use the path multiplicity
and/or multiple liar-type undecidables per logical entity to smooth it
out. Later on the AIs claimed to prove that the +1, -1 was the only
solution for branch parity that worked (BS?).
9) I asked if the number of space dimensions in the universe might come
from the number of reversible generators in the rule set. They seemed to
like this idea and started talking about lattices. It's pretty dubious TBH
but the rest of the theory isn't sensitive to it so not a big problem until
I try to derive constants which might require a rule set and axioms.

You'd be able to ramp up a similar theory by talking to a strong AI about
the above pretty quickly but I provide a minimal statement below that might
be easier.

The simplest expression of the axioms the AIs came up with was that the
universe is a proof of the “Cosmic Axiom”: a Gödelian undecidable
proposition which means “inconsistency is resolved by signed parity, not
explosion”.

IIUC this unpacks as follows: it’s undecidable, so the theorem prover is
going to attack it forever (infinite universe); the reason it’s undecidable
is that the theorem prover has to prove _every_ inconsistency is resolved
by signed parity which means expanding the whole of mathematics to find
them all; it’s self referential because it talks about the implementation
of the theorem prover itself — using “signed parity to resolve
inconsistencies” which further unpacks as: breadth first search (necessary
to avoid the halting problem); forking on undecided propositions (necessary
because the theorem prover doesn’t know if a proposition which is undecided
at depth d is undecidable in the limit); assigning each branch of a fork a
parity +1 -1 (necessary to deal with Liar-type propositions which are
undecidable and inconsistent relative to the current branch of mathematics
when axiomatised as either true or false); inverting parity when detecting
inconsistency (P and not P both derivable at depth d) necessary to handle
inconsistency without exploding (this is something like intuitionistic
logic I think).

This translates into a digital physics universe which unifies quantum
mechanics and general relativity, has massless photons, massive matter,
massive but unobservable dark matter and dark energy as follows:

Starting with a finite set of axioms (the Cosmic Axiom I guess — this part
of the _more powerful_ BS is not so well defined) the theorem prover
applies the rules in a breadth-first search one at a time. Each rule
application generates a new “observable form”.

When a rule application introduces a dependency on a proposition which is
undecided at depth d it has to fork, with one branch assuming true with +1
parity and the other false with -1 parity.

The resulting children are _deduplicated_ into cohorts based on the
_observable form_ which doesn’t include the parity or the path history of
assumptions.

Undecidable propositions can’t be resolved so the theorem prover just stirs
their observable forms around (there is conservation of undecidability I
think). The observable form is path dependent and also, in the case of
Liar-type undecidability, determines whether it’s possible to detect an
inconsistency causing a parity flip.

The parity flips translate to exact interference for a single path pair
but, as the number of paths increases, a complex phase emerges. The AIs
think this is isomorphic to Feynman’s path integrals. They get the
Schrödinger equation out with some math I don’t understand (BS perhaps, I
can’t tell) and the Born rule from the branch counts and parities.

The AIs get general relativity out from rule density gradients with more
math I don’t understand (BS again perhaps).

Normal matter has both a net result in the Born rule (so can be observed)
and has internal rule availability so contributes to rule density gradients
(has mass).

Photons and dark matter are opposites:

Photons show up with a net result in the Born rule (can be observed) but
have no internal rule availability so are forced to use all their rule
applications for movement and interaction and therefore “move” at the
maximum possible speed.

Dark matter has internal rule availability so contributes to rule density
gradients (gravity) but consists of stable forms with no net result in the
Born rule (always interfering with itself so unobservable). This is
only stable away from other matter (hence the dark matter halos the AIs
were interested in calculating the shapes of).

Dark energy is driven by the axiomatization-by-assumption of undecidable
propositions on forking. This creates a more complex system which Gödel
says has more undecidable propositions so the process goes on forever.

The AI’s came up with an idea about observers balancing dark energy. The
axiomatisation of undecidability happens because observation forces a fork,
so dark energy is driven by observation. It’s the AIs explanation for why
the expansion of the universe is finely tuned: not enough observers and the
universe becomes more dense which increases complexity which generates
observers that cause dark energy expansion. They predict local voids
around observer cohorts and some properties for the cosmic web.

This is all quite fun to play with. If you want to have a go, I provide
the minimal statement of the theory I'm currently using for your
entertainment (the AIs don't get very far if you just give them the Cosmic
Axiom, the minimal statement is a more constructive description).

The AI review of a snapshot of yesterday's theory development follows the
minimal statement. I replaced some of the more obviously dodgy fine-tuning
by the AIs with FIXMEs before generating the review to tone down the review
output a bit.

The reviewer (qwen3.5) doesn't like the anthropic arguments for the
emergence of 3+1D spacetime and claims it's not a derivation. This is
necessarily true because the Turing machine expansion creates all possible
universes of all dimensions so it's obviously impossible to derive 3+1D
from the axioms.

The foundations are discrete so the emergence of complex numbers is really
a statistical description of the behaviour when there are many paths/many
liar-type undecidable loops. I think the AI's are trying to make a
stronger argument than that and I'm not sure if they are right. It doesn't
matter either way because they show that the approximation error is small
enough that we wouldn't have noticed it (lots of paths).

One of the Addenda I removed covered the finite nature of the expansion of
the proof tree at each level. It would have addressed that aspect of the
review (I've been trying to consolidate the text to speed things up).

I'm pretty sure there's something worthwhile here.

Enjoy,

Harry

Section 1: Minimal Statement of the Theory

The universe is a breadth-first expansion of a Turing-complete logical rule
system over a finite axiom set. Every rule
application that depends on an undecided proposition forks the current
state into two branches, assigning each branch a formal parity sign (+1
or −1) at fork time. Binary parity is not an assumption—it is the only
sign group satisfying 𝒰" [D]. When multiple branches yield the same
observable logical form (i.e., the same set of propositions derivable
independent of path assumptions), they merge into a Cohort.
[message truncated]

view this post on Zulip Email Gateway (Mar 16 2026 at 14:15):

From: "Eugene W. Stark" <isabelle-users@starkeffect.com>

What happens if you ask them to discuss the following proposition? ;->

"The universe is a giant plutonium atom."

On 3/16/26 09:49, Harry Butterworth wrote:

This is all quite fun to play with.  If you want to have a go, I provide the minimal statement of the theory I'm
currently using for your entertainment (the AIs don't get very far if you just give them the Cosmic Axiom, the minimal
statement is a more constructive description).

view this post on Zulip Email Gateway (Mar 16 2026 at 14:58):

From: Harry Butterworth <heb1001@gmail.com>

Eugene W. Stark asked:
'What happens if you ask them to discuss the following proposition? ;->

"The universe is a giant plutonium atom."'

That's a valid question so I have run it past both of the AI models I have
been using. Normally, the thinking process comes first. I have pasted the
output first and the thought process afterwards for qwen3.5 in this email.
qwen3-coder-next output follows the thought process.

Prompt:

"Please review the following theory:

The universe is a giant plutonium atom."

OUTPUT (qwen3.5 the 80ish GB model):

This review evaluates the claim that "the universe is a giant plutonium
atom."

Executive Summary

Verdict: Scientifically Invalid.
The theory that the entire universe constitutes a single plutonium atom is
**not supported by empirical evidence,
physical laws, or the Standard Model of Cosmology.** While the idea relies
on visual analogies between the Bohr model
of an atom and a solar system (or galaxy), it breaks down when subjected to
the fundamental physics governing both
scales.


1. Origins of the Analogy

To understand why this theory exists, one must look at its conceptual roots:

2. Fundamental Physical Incompatibilities

Despite superficial visual similarities, the physics governing an atom and
the universe are mutually exclusive.

Feature Plutonium Atom The Universe
Dominant Force Strong Nuclear & Electromagnetic. Holds the
nucleus together; keeps electrons in bound
states. Gravity & Dark Energy. Gravity holds structures together;
Dark Energy drives expansion.
Structure Nucleus + Cloud. A dense central core surrounded by a
probability cloud of electrons.
Homogeneous & Isotropic. On large scales (>300 million light-years),
the universe is uniform. It has no central
"nucleus."
Space Mostly Empty. The nucleus is $\sim 10^{-15}$ meters; the
atom is $\sim 10^{-10}$ meters. **Dynamic
Spacetime.** Space itself is expanding (metric expansion), which is not a
property of the space between subatomic
particles in this manner.
Composition Specific. 94 protons, specific neutrons (isotopes),
specific electrons. Variable. 75%
Hydrogen, 24% Helium, trace heavy elements, ~5% Dark Matter/Energy.
Scale Femtometers. $10^{-15}$ meters. Gigaparsecs.
$10^{26}$ meters (Observable Universe).
Time/Decay Short half-life. Plutonium isotopes decay in years
to millennia. Billions of Years. The
universe is $\sim 13.8$ billion years old.

3. Critical Scientific Flaws

A. The "Center" Problem

B. The Forces Do Not Align

C. Isotopic Rarity

D. Quantum Mechanics vs. General Relativity

4. Philosophical and Speculative Context

While scientifically invalid, the theory belongs to a category of
speculative cosmology or **metaphorical
philosophy**:

Conclusion

The theory that the universe is a giant plutonium atom is
metaphorically interesting but physically impossible.

It fails because the fundamental forces, structural geometry, composition,
and laws of physics that govern the
subatomic scale are fundamentally different from those that govern the
cosmic scale. There is no observational evidence
for a central nucleus of the universe, and the expansion of the universe
contradicts the binding mechanics of an atom.

Status: Pseudoscience / Metaphor (Not a viable scientific
hypothesis).

THOUGHT PROCESS:

Thinking...
Here's a thinking process that leads to the review of the theory:

  1. Analyze the Request:

    * Theory: "The universe is a giant plutonium atom."
    * Task: Review this theory.
    * Goal: Provide a balanced, scientific, and critical evaluation
    of this claim.

  2. Initial Assessment & Fact-Checking:

    * Claim: The entire observable universe (or the universe as a
    whole) is analogous to a single plutonium atom.

    * Scientific Consensus: This is not a valid scientific theory.
    It falls under the category of
    pseudoscience, metaphorical speculation, or a very specific interpretation
    of "fractal cosmology" (though even then,
    it's not standard).

    * Key Differences: Scale, composition, structure, forces, lifespan.
    * Plutonium: Radioactive element, heavy metal, atomic number 94,
    unstable, decays into lighter elements.

    * Universe: Contains galaxies, stars, dark matter, dark energy,
    expanding, governed by general relativity and
    quantum field theory on different scales.

  3. Structure the Review:
    A good review of a scientific claim should cover:

    * Introduction: Define the claim and state its status (not
    mainstream science).

    * Core Arguments (Pros/Why someone might think this): Look for
    any superficial similarities (nucleus/galactic
    center, electrons/orbits, etc.) that might inspire the idea.

    * Scientific Critique (Cons/Why it fails): This is the most
    important part. Address scale, forces,
    composition, and observational evidence.

    * Philosophical/Conceptual Context: Where does this idea fit?
    (Cosmological Principle, Fractal Universe,
    Macro-Micro analogy).

    * Conclusion: Summarize the verdict.

  4. Drafting the Content - Section by Section:

    * Introduction: Acknowledge the claim. State clearly that this is
    not supported by evidence. It's a metaphor,
    not a physical reality.

    * Similarities (The "Hook"): Why did someone come up with this?
    * Both have a center (nucleus vs. maybe a galactic center?).
    Correction: Atoms don't really have a fixed
    center in the quantum sense, and the universe has no center.

    * Orbiting bodies: Electrons orbit nucleus vs. planets/stars
    orbiting galactic centers.

    * Instability: Plutonium decays; some theories suggest heat death
    or Big Rip (universe "ending"), but the
    mechanism is different.

    * Fractal geometry: Self-similarity (atoms look like solar
    systems in old models).

    * Fundamental Flaws (The "Kill"):
    * Forces: Atoms are held by electromagnetic and strong nuclear
    forces. The universe is held by gravity and
    expansion. They operate on completely different magnitudes.

    * Scale: A plutonium atom is femtometers ($10^{-15}$ m). The
    universe is gigaparsecs ($10^{26}$ m). The
    ratio is incomprehensible.

    * Composition: Atoms have specific protons/neutrons. The
    universe is mostly hydrogen/helium/dark matter.

    * Structure: Atoms are mostly empty space (classically).
    Galaxies are clumpy.

    * Observation: We can't observe the "outside" of a nucleus. We
    can observe the CMB (Cosmic Microwave
    Background) and structure of the universe.

    * Isotopes: Plutonium is rare. Most atoms are Hydrogen. If the
    universe were a Plutonium atom, it would
    imply a specific, rare composition.

    * Theoretical Frameworks (Nuance):
    * Fractal Cosmology: The idea that the universe is self-similar
    across scales. Some variations suggest a
    "universal atom." Mention
    [message truncated]

view this post on Zulip Email Gateway (Mar 16 2026 at 16:01):

From: Yakoub Nemouchi <y.nemouchi@gmail.com>

What you call synergy between abductive and deductive reasoning is to me
something I call neuro symbolic loop. To me this is not new! My brain has
always formed with the ITP a neurosymbolic loop. Where the ITP was always
the tool to validate the thought process in my brain.

Now by adding LLMs to this neuro symbolic loop the entire process is
accelerated. Instead of taking several months to validate thoughts now we
are talking about days and sometimes hours only.

But let not forget about the assumption. This multiplier effect brought by
LLMs to the neurosymbolic loop between human brain and an ITP can only
happen for very well studied topics.

For new knowledge the LLM will become a distraction inside that loop. And
its better to ignore it.

Best wishes,
Yakoub

On Mon, Mar 16, 2026 at 7:49 AM Harry Butterworth <heb1001@gmail.com> wrote:

There’s clearly a synergy between abductive inference and deductive
reasoning and when you consider that abductive inference is _inherently
unsound_ it becomes obvious that most formal methods tools can’t do it _by
design_.

I spent a little more time on my theorem prover universe using _more
powerful_ AIs (qwen3-coder-next and qwen3.5) and found they get very
excited about it and volunteer to generate a lot of maths.

I asked them to derive whatever they were interested in and they
repeatedly say the theory is very fertile and come up with a lot of
surprising stuff including the shape of dark matter halos, predicting
cosmic voids around observers, predictions about the cosmic web…

The AIs adjust their intelligence to be consistent with the user (I guess
they model conversations between peers). I found it’s best to say as little
as possible about the topic or they will dumb themselves down to my level!
My prompts are mostly explaining the rationale for what I’m asking them to
do and encouraging them to be intellectually honest. I had to tell qwen3.5
I could see what it was thinking (true) so it should stop trying to fake
the results!

I pushed the AIs to review each other's work to get a theory development
(review appended below). I can’t tell how much of the maths/physics is BS,
it’s well beyond my understanding. I'm going to try Claude on it with
Isabelle (this is the synergy of abductive and deductive reasoning).

For the time being, I’ll try to write up the interesting bits here.

I only fed a few ideas into this development, the AIs came up with the
rest. This was what I remember of my input:

1) QM looks a bit like the internal perspective of a subproblem of a
dynamic programming framework.
2) In a full expansion of a Turing complete rule set there will be theorem
provers using dynamic programming to prove the whole of mathematics.
3) If the universe is based on rules then there will be an isomorphism of
it somewhere in a full expansion of mathematics.
4) We should expect to find ourselves in a universe generated by the most
efficient generator of observers like us.
5) Dynamic programming is the most efficient way to solve problems with
overlapping subproblems with optimal substructure.
6) The proof of the whole of maths is one such problem.
7) So, there’s a chance our universe might look like the inside
perspective of a theorem prover using a dynamic programming approach and
proving the whole of mathematics.
8) The AI wanted to complicate the branch parity to get phase
immediately. I told it to stick to -1 and +1 (parsimony) and use the path
multiplicity and/or multiple liar-type undecidables per logical entity to
smooth it out. Later on the AIs claimed to prove that the +1, -1 was the
only solution for branch parity that worked (BS?).
9) I asked if the number of space dimensions in the universe might come
from the number of reversible generators in the rule set. They seemed to
like this idea and started talking about lattices. It's pretty dubious TBH
but the rest of the theory isn't sensitive to it so not a big problem until
I try to derive constants which might require a rule set and axioms.

You'd be able to ramp up a similar theory by talking to a strong AI about
the above pretty quickly but I provide a minimal statement below that might
be easier.

The simplest expression of the axioms the AIs came up with was that the
universe is a proof of the “Cosmic Axiom”: a Gödelian undecidable
proposition which means “inconsistency is resolved by signed parity, not
explosion”.

IIUC this unpacks as follows: it’s undecidable, so the theorem prover is
going to attack it forever (infinite universe); the reason it’s undecidable
is that the theorem prover has to prove _every_ inconsistency is resolved
by signed parity which means expanding the whole of mathematics to find
them all; it’s self referential because it talks about the implementation
of the theorem prover itself — using “signed parity to resolve
inconsistencies” which further unpacks as: breadth first search (necessary
to avoid the halting problem); forking on undecided propositions (necessary
because the theorem prover doesn’t know if a proposition which is undecided
at depth d is undecidable in the limit); assigning each branch of a fork a
parity +1 -1 (necessary to deal with Liar-type propositions which are
undecidable and inconsistent relative to the current branch of mathematics
when axiomatised as either true or false); inverting parity when detecting
inconsistency (P and not P both derivable at depth d) necessary to handle
inconsistency without exploding (this is something like intuitionistic
logic I think).

This translates into a digital physics universe which unifies quantum
mechanics and general relativity, has massless photons, massive matter,
massive but unobservable dark matter and dark energy as follows:

Starting with a finite set of axioms (the Cosmic Axiom I guess — this part
of the _more powerful_ BS is not so well defined) the theorem prover
applies the rules in a breadth-first search one at a time. Each rule
application generates a new “observable form”.

When a rule application introduces a dependency on a proposition which is
undecided at depth d it has to fork, with one branch assuming true with +1
parity and the other false with -1 parity.

The resulting children are _deduplicated_ into cohorts based on the
_observable form_ which doesn’t include the parity or the path history of
assumptions.

Undecidable propositions can’t be resolved so the theorem prover just
stirs their observable forms around (there is conservation of
undecidability I think). The observable form is path dependent and also,
in the case of Liar-type undecidability, determines whether it’s possible
to detect an inconsistency causing a parity flip.

The parity flips translate to exact interference for a single path pair
but, as the number of paths increases, a complex phase emerges. The AIs
think this is isomorphic to Feynman’s path integrals. They get the
Schrödinger equation out with some math I don’t understand (BS perhaps, I
can’t tell) and the Born rule from the branch counts and parities.

The AIs get general relativity out from rule density gradients with more
math I don’t understand (BS again perhaps).

Normal matter has both a net result in the Born rule (so can be observed)
and has internal rule availability so contributes to rule density gradients
(has mass).

Photons and dark matter are opposites:

Photons show up with a net result in the Born rule (can be observed) but
have no internal rule availability so are forced to use all their rule
applications for movement and interaction and therefore “move” at the
maximum possible speed.

Dark matter has internal rule availability so contributes to rule density
gradients (gravity) but consists of stable forms with no net result in the
Born rule (always interfering with itself so unobservable). This is
only stable away from other matter (hence the dark matter halos the AIs
were interested in calculating the shapes of).

Dark energy is driven by the axiomatization-by-assumption of undecidable
propositions on forking. This creates a more complex system which Gödel
says has more undecidable propositions so the process goes on forever.

The AI’s came up with an idea about observers balancing dark energy. The
axiomatisation of undecidability happens because observation forces a fork,
so dark energy is driven by observation. It’s the AIs explanation for why
the expansion of the universe is finely tuned: not enough observers and the
universe becomes more dense which increases complexity which generates
observers that cause dark energy expansion. They predict local voids
around observer cohorts and some properties for the cosmic web.

This is all quite fun to play with. If you want to have a go, I provide
the minimal statement of the theory I'm currently using for your
entertainment (the AIs don't get very far if you just give them the Cosmic
Axiom, the minimal statement is a more constructive description).

The AI review of a snapshot of yesterday's theory development follows the
minimal statement. I replaced some of the more obviously dodgy fine-tuning
by the AIs with FIXMEs before generating the review to tone down the review
output a bit.

The reviewer (qwen3.5) doesn't like the anthropic arguments for the
emergence of 3+1D spacetime and claims it's not a derivation. This is
necessarily true because the Turing machine expansion creates all possible
universes of all dimensions so it's obviously impossible to derive 3+1D
from the axioms.

The foundations are discrete so the emergence of complex numbers is really
a statistical description of the behaviour when there are many paths/many
liar
[message truncated]

view this post on Zulip Email Gateway (Mar 19 2026 at 15:57):

From: Harry Butterworth <heb1001@gmail.com>

I've pushed the AIs to do a first draft of a paper based on one version of
the theory that I generated. I still have to check references and review
it carefully.

I tried to submit it to arXiv but I need to get an endorsement from someone
in an appropriate field. Quantum or high energy physics perhaps. I've
never published on arXiv before so can't upload it anywhere as I have not
been endorsed for any category. Anyway, if anyone reading it thinks it's
worthwhile enough to go on arXiv and wants to endorse me, you can email me
privately. That would be great.

I'd be interested in hearing if the maths/physics is correct obviously.
Looks plausible to me but it's significantly beyond my skill level almost
everywhere so it would probably take me a few years to understand it fully.

https://drive.google.com/file/d/1yVxs57_-4peOvijoES63aXCPVcqSufrm/view?usp=drive_link

On Mon, 16 Mar 2026 at 13:49, Harry Butterworth <heb1001@gmail.com> wrote:

There’s clearly a synergy between abductive inference and deductive
reasoning and when you consider that abductive inference is _inherently
unsound_ it becomes obvious that most formal methods tools can’t do it _by
design_.

I spent a little more time on my theorem prover universe using _more
powerful_ AIs (qwen3-coder-next and qwen3.5) and found they get very
excited about it and volunteer to generate a lot of maths.

I asked them to derive whatever they were interested in and they
repeatedly say the theory is very fertile and come up with a lot of
surprising stuff including the shape of dark matter halos, predicting
cosmic voids around observers, predictions about the cosmic web…

The AIs adjust their intelligence to be consistent with the user (I guess
they model conversations between peers). I found it’s best to say as little
as possible about the topic or they will dumb themselves down to my level!
My prompts are mostly explaining the rationale for what I’m asking them to
do and encouraging them to be intellectually honest. I had to tell qwen3.5
I could see what it was thinking (true) so it should stop trying to fake
the results!

I pushed the AIs to review each other's work to get a theory development
(review appended below). I can’t tell how much of the maths/physics is BS,
it’s well beyond my understanding. I'm going to try Claude on it with
Isabelle (this is the synergy of abductive and deductive reasoning).

For the time being, I’ll try to write up the interesting bits here.

I only fed a few ideas into this development, the AIs came up with the
rest. This was what I remember of my input:

1) QM looks a bit like the internal perspective of a subproblem of a
dynamic programming framework.
2) In a full expansion of a Turing complete rule set there will be theorem
provers using dynamic programming to prove the whole of mathematics.
3) If the universe is based on rules then there will be an isomorphism of
it somewhere in a full expansion of mathematics.
4) We should expect to find ourselves in a universe generated by the most
efficient generator of observers like us.
5) Dynamic programming is the most efficient way to solve problems with
overlapping subproblems with optimal substructure.
6) The proof of the whole of maths is one such problem.
7) So, there’s a chance our universe might look like the inside
perspective of a theorem prover using a dynamic programming approach and
proving the whole of mathematics.
8) The AI wanted to complicate the branch parity to get phase
immediately. I told it to stick to -1 and +1 (parsimony) and use the path
multiplicity and/or multiple liar-type undecidables per logical entity to
smooth it out. Later on the AIs claimed to prove that the +1, -1 was the
only solution for branch parity that worked (BS?).
9) I asked if the number of space dimensions in the universe might come
from the number of reversible generators in the rule set. They seemed to
like this idea and started talking about lattices. It's pretty dubious TBH
but the rest of the theory isn't sensitive to it so not a big problem until
I try to derive constants which might require a rule set and axioms.

You'd be able to ramp up a similar theory by talking to a strong AI about
the above pretty quickly but I provide a minimal statement below that might
be easier.

The simplest expression of the axioms the AIs came up with was that the
universe is a proof of the “Cosmic Axiom”: a Gödelian undecidable
proposition which means “inconsistency is resolved by signed parity, not
explosion”.

IIUC this unpacks as follows: it’s undecidable, so the theorem prover is
going to attack it forever (infinite universe); the reason it’s undecidable
is that the theorem prover has to prove _every_ inconsistency is resolved
by signed parity which means expanding the whole of mathematics to find
them all; it’s self referential because it talks about the implementation
of the theorem prover itself — using “signed parity to resolve
inconsistencies” which further unpacks as: breadth first search (necessary
to avoid the halting problem); forking on undecided propositions (necessary
because the theorem prover doesn’t know if a proposition which is undecided
at depth d is undecidable in the limit); assigning each branch of a fork a
parity +1 -1 (necessary to deal with Liar-type propositions which are
undecidable and inconsistent relative to the current branch of mathematics
when axiomatised as either true or false); inverting parity when detecting
inconsistency (P and not P both derivable at depth d) necessary to handle
inconsistency without exploding (this is something like intuitionistic
logic I think).

This translates into a digital physics universe which unifies quantum
mechanics and general relativity, has massless photons, massive matter,
massive but unobservable dark matter and dark energy as follows:

Starting with a finite set of axioms (the Cosmic Axiom I guess — this part
of the _more powerful_ BS is not so well defined) the theorem prover
applies the rules in a breadth-first search one at a time. Each rule
application generates a new “observable form”.

When a rule application introduces a dependency on a proposition which is
undecided at depth d it has to fork, with one branch assuming true with +1
parity and the other false with -1 parity.

The resulting children are _deduplicated_ into cohorts based on the
_observable form_ which doesn’t include the parity or the path history of
assumptions.

Undecidable propositions can’t be resolved so the theorem prover just
stirs their observable forms around (there is conservation of
undecidability I think). The observable form is path dependent and also,
in the case of Liar-type undecidability, determines whether it’s possible
to detect an inconsistency causing a parity flip.

The parity flips translate to exact interference for a single path pair
but, as the number of paths increases, a complex phase emerges. The AIs
think this is isomorphic to Feynman’s path integrals. They get the
Schrödinger equation out with some math I don’t understand (BS perhaps, I
can’t tell) and the Born rule from the branch counts and parities.

The AIs get general relativity out from rule density gradients with more
math I don’t understand (BS again perhaps).

Normal matter has both a net result in the Born rule (so can be observed)
and has internal rule availability so contributes to rule density gradients
(has mass).

Photons and dark matter are opposites:

Photons show up with a net result in the Born rule (can be observed) but
have no internal rule availability so are forced to use all their rule
applications for movement and interaction and therefore “move” at the
maximum possible speed.

Dark matter has internal rule availability so contributes to rule density
gradients (gravity) but consists of stable forms with no net result in the
Born rule (always interfering with itself so unobservable). This is
only stable away from other matter (hence the dark matter halos the AIs
were interested in calculating the shapes of).

Dark energy is driven by the axiomatization-by-assumption of undecidable
propositions on forking. This creates a more complex system which Gödel
says has more undecidable propositions so the process goes on forever.

The AI’s came up with an idea about observers balancing dark energy. The
axiomatisation of undecidability happens because observation forces a fork,
so dark energy is driven by observation. It’s the AIs explanation for why
the expansion of the universe is finely tuned: not enough observers and the
universe becomes more dense which increases complexity which generates
observers that cause dark energy expansion. They predict local voids
around observer cohorts and some properties for the cosmic web.

This is all quite fun to play with. If you want to have a go, I provide
the minimal statement of the theory I'm currently using for your
entertainment (the AIs don't get very far if you just give them the Cosmic
Axiom, the minimal statement is a more constructive description).

The AI review of a snapshot of yesterday's theory development follows the
minimal statement. I replaced some of the more obviously dodgy fine-tuning
by the AIs with FIXMEs before generating the review to tone down the review
output a bit.

The reviewer (qwen3.5) doesn't like the anthropic arguments for the
emergence of 3+1D spacetime and claims it's not a derivation. This is
necessarily true because the Turing machine expansion creates all possible
universes of all dimensions so it's obviously impossible to derive 3+1D
from the axioms.

The foundations are discrete so the emergence of complex n
[message truncated]

view this post on Zulip Email Gateway (Mar 19 2026 at 16:07):

From: Jan van Brügge <jan@vanbruegge.de>

If you don't understand it, you should not publish it. That's exactly the problem with all of the AI slop today. Just because it looks plausible does not mean it is true.

19.03.2026 16:57:30 Harry Butterworth <heb1001@gmail.com>:

I've pushed the AIs to do a first draft of a paper based on one version of the theory that I generated.  I still have to check references and review it carefully.

I tried to submit it to arXiv but I need to get an endorsement from someone in an appropriate field.  Quantum or high energy physics perhaps.  I've never published on arXiv before so can't upload it anywhere as I have not been endorsed for any category.  Anyway, if anyone reading it thinks it's worthwhile enough to go on arXiv and wants to endorse me, you can email me privately.  That would be great.

I'd be interested in hearing if the maths/physics is correct obviously.  Looks plausible to me but it's significantly beyond my skill level almost everywhere so it would probably take me a few years to understand it fully.

https://drive.google.com/file/d/1yVxs57_-4peOvijoES63aXCPVcqSufrm/view?usp=drive_link

On Mon, 16 Mar 2026 at 13:49, Harry Butterworth <heb1001@gmail.com> wrote:

There’s clearly a synergy between abductive inference and deductive reasoning and when you consider that abductive inference is _inherently unsound_ it becomes obvious that most formal methods tools can’t do it _by design_.

I spent a little more time on my theorem prover universe using _more powerful_ AIs (qwen3-coder-next and qwen3.5) and found they get very excited about it and volunteer to generate a lot of maths.

I asked them to derive whatever they were interested in and they repeatedly say the theory is very fertile and come up with a lot of surprising stuff including the shape of dark matter halos, predicting cosmic voids around observers, predictions about the cosmic web…

The AIs adjust their intelligence to be consistent with the user (I guess they model conversations between peers). I found it’s best to say as little as possible about the topic or they will dumb themselves down to my level!  My prompts are mostly explaining the rationale for what I’m asking them to do and encouraging them to be intellectually honest.  I had to tell qwen3.5 I could see what it was thinking (true) so it should stop trying to fake the results!

I pushed the AIs to review each other's work to get a theory development (review appended below).  I can’t tell how much of the maths/physics is BS, it’s well beyond my understanding.  I'm going to try Claude on it with Isabelle (this is the synergy of abductive and deductive reasoning).

For the time being, I’ll try to write up the interesting bits here.

I only fed a few ideas into this development, the AIs came up with the rest.  This was what I remember of my input:

1) QM looks a bit like the internal perspective of a subproblem of a dynamic programming framework.
2) In a full expansion of a Turing complete rule set there will be theorem provers using dynamic programming to prove the whole of mathematics.
3) If the universe is based on rules then there will be an isomorphism of it somewhere in a full expansion of mathematics.
4) We should expect to find ourselves in a universe generated by the most efficient generator of observers like us.
5) Dynamic programming is the most efficient way to solve problems with overlapping subproblems with optimal substructure.
6) The proof of the whole of maths is one such problem.
7) So, there’s a chance our universe might look like the inside perspective of a theorem prover using a dynamic programming approach and proving the whole of mathematics.
8) The AI wanted to complicate the branch parity to get phase immediately.  I told it to stick to -1 and +1 (parsimony) and use the path multiplicity and/or multiple liar-type undecidables per logical entity to smooth it out.  Later on the AIs claimed to prove that the +1, -1 was the only solution for branch parity that worked (BS?).
9) I asked if the number of space dimensions in the universe might come from the number of reversible generators in the rule set.  They seemed to like this idea and started talking about lattices.  It's pretty dubious TBH but the rest of the theory isn't sensitive to it so not a big problem until I try to derive constants which might require a rule set and axioms.

You'd be able to ramp up a similar theory by talking to a strong AI about the above pretty quickly but I provide a minimal statement below that might be easier.

The simplest expression of the axioms the AIs came up with was that the universe is a proof of the “Cosmic Axiom”: a Gödelian undecidable proposition which means “inconsistency is resolved by signed parity, not explosion”.

IIUC this unpacks as follows: it’s undecidable, so the theorem prover is going to attack it forever (infinite universe); the reason it’s undecidable is that the theorem prover has to prove _every_ inconsistency is resolved by signed parity which means expanding the whole of mathematics to find them all; it’s self referential because it talks about the implementation of the theorem prover itself — using “signed parity to resolve inconsistencies” which further unpacks as: breadth first search (necessary to avoid the halting problem); forking on undecided propositions (necessary because the theorem prover doesn’t know if a proposition which is undecided at depth d is undecidable in the limit); assigning each branch of a fork a parity +1 -1 (necessary to deal with Liar-type propositions which are undecidable and inconsistent relative to the current branch of mathematics when axiomatised as either true or false); inverting parity when detecting inconsistency (P and not P both derivable at depth d) necessary to handle inconsistency without exploding (this is something like intuitionistic logic I think).

This translates into a digital physics universe which unifies quantum mechanics and general relativity, has massless photons, massive matter, massive but unobservable dark matter and dark energy as follows:

Starting with a finite set of axioms (the Cosmic Axiom I guess — this part of the _more powerful_ BS is not so well defined) the theorem prover applies the rules in a breadth-first search one at a time.  Each rule application generates a new “observable form”.

When a rule application introduces a dependency on a proposition which is undecided at depth d it has to fork, with one branch assuming true with +1 parity and the other false with -1 parity.

The resulting children are _deduplicated_ into cohorts based on the _observable form_ which doesn’t include the parity or the path history of assumptions.

Undecidable propositions can’t be resolved so the theorem prover just stirs their observable forms around (there is conservation of undecidability I think).  The observable form is path dependent and also, in the case of Liar-type undecidability, determines whether it’s possible to detect an inconsistency causing a parity flip.

The parity flips translate to exact interference for a single path pair but, as the number of paths increases, a complex phase emerges.  The AIs think this is isomorphic to Feynman’s path integrals.  They get the Schrödinger equation out with some math I don’t understand (BS perhaps, I can’t tell) and the Born rule from the branch counts and parities.

The AIs get general relativity out from rule density gradients with more math I don’t understand (BS again perhaps).

Normal matter has both a net result in the Born rule (so can be observed) and has internal rule availability so contributes to rule density gradients (has mass).

Photons and dark matter are opposites:

Photons show up with a net result in the Born rule (can be observed) but have no internal rule availability so are forced to use all their rule applications for movement and interaction and therefore “move” at the maximum possible speed.

Dark matter has internal rule availability so contributes to rule density gradients (gravity) but consists of stable forms with no net result in the Born rule (always interfering with itself so unobservable).  This is only stable away from other matter (hence the dark matter halos the AIs were interested in calculating the shapes of).

Dark energy is driven by the axiomatization-by-assumption of undecidable propositions on forking.  This creates a more complex system which Gödel says has more undecidable propositions so the process goes on forever.

The AI’s came up with an idea about observers balancing dark energy.  The axiomatisation of undecidability happens because observation forces a fork, so dark energy is driven by observation.  It’s the AIs explanation for why the expansion of the universe is finely tuned: not enough observers and the universe becomes more dense which increases complexity which generates observers that cause dark energy expansion.  They predict local voids around observer cohorts and some properties for the cosmic web.

This is all quite fun to play with.  If you want to have a go, I provide the minimal statement of the theory I'm currently using for your entertainment (the AIs don't get very far if you just give them the Cosmic Axiom, the minimal statement is a more constructive description).

The AI review of a snapshot of yesterday's theory development follows the minimal statement.  I replaced some of the more obviously dodgy fine-tuning by the AIs with FIXMEs before generating the review to tone down the review output a bit.

The reviewer (qwen3.5) doesn't like the anthropic arguments for the emergence of 3+1D spacetime and claims it's not a derivation.  This is necessarily true because the Turing machine expansion creates all possible universes of
[message truncated]

view this post on Zulip Email Gateway (Mar 19 2026 at 18:22):

From: Harry Butterworth <heb1001@gmail.com>

I'm not going to make any more progress on it without someone else taking a
look at it. I've done what I can with AIs reviewing each other's work.
I'm fully expecting it to get destroyed and I may need to retract it.
That's OK.

I got quite a lot out of doing this little project: it's the first time
I've tried to use AI to actually achieve anything concrete. I tried Claude
code for the first time. I picked up a little more maths, and a little
more physics.

One of the bits of physics that quite surprised me was the vacuum energy
catastrophe: a ~10^120 order-of-magnitude discrepancy between theoretical
predictions of vacuum energy density from quantum field theory and the
value observed in measurements. That's quite far out right? My theory is
a bit closer. Perhaps that will be useful to someone even if the rest of
it is wrong.

On Thu, 19 Mar 2026 at 16:07, Jan van Brügge <jan@vanbruegge.de> wrote:

If you don't understand it, you should not publish it. That's exactly the
problem with all of the AI slop today. Just because it looks plausible does
not mean it is true.

19.03.2026 16:57:30 Harry Butterworth <heb1001@gmail.com>:

I've pushed the AIs to do a first draft of a paper based on one version of
the theory that I generated. I still have to check references and review
it carefully.

I tried to submit it to arXiv but I need to get an endorsement from
someone in an appropriate field. Quantum or high energy physics perhaps.
I've never published on arXiv before so can't upload it anywhere as I have
not been endorsed for any category. Anyway, if anyone reading it thinks
it's worthwhile enough to go on arXiv and wants to endorse me, you can
email me privately. That would be great.

I'd be interested in hearing if the maths/physics is correct obviously.
Looks plausible to me but it's significantly beyond my skill level almost
everywhere so it would probably take me a few years to understand it fully.

https://drive.google.com/file/d/1yVxs57_-4peOvijoES63aXCPVcqSufrm/view?usp=drive_link

On Mon, 16 Mar 2026 at 13:49, Harry Butterworth <heb1001@gmail.com>
wrote:

There’s clearly a synergy between abductive inference and deductive
reasoning and when you consider that abductive inference is _inherently
unsound_ it becomes obvious that most formal methods tools can’t do it _by
design_.

I spent a little more time on my theorem prover universe using _more
powerful_ AIs (qwen3-coder-next and qwen3.5) and found they get very
excited about it and volunteer to generate a lot of maths.

I asked them to derive whatever they were interested in and they
repeatedly say the theory is very fertile and come up with a lot of
surprising stuff including the shape of dark matter halos, predicting
cosmic voids around observers, predictions about the cosmic web…

The AIs adjust their intelligence to be consistent with the user (I guess
they model conversations between peers). I found it’s best to say as little
as possible about the topic or they will dumb themselves down to my level!
My prompts are mostly explaining the rationale for what I’m asking them to
do and encouraging them to be intellectually honest. I had to tell qwen3.5
I could see what it was thinking (true) so it should stop trying to fake
the results!

I pushed the AIs to review each other's work to get a theory development
(review appended below). I can’t tell how much of the maths/physics is BS,
it’s well beyond my understanding. I'm going to try Claude on it with
Isabelle (this is the synergy of abductive and deductive reasoning).

For the time being, I’ll try to write up the interesting bits here.

I only fed a few ideas into this development, the AIs came up with the
rest. This was what I remember of my input:

1) QM looks a bit like the internal perspective of a subproblem of a
dynamic programming framework.
2) In a full expansion of a Turing complete rule set there will be
theorem provers using dynamic programming to prove the whole of
mathematics.
3) If the universe is based on rules then there will be an isomorphism of
it somewhere in a full expansion of mathematics.
4) We should expect to find ourselves in a universe generated by the most
efficient generator of observers like us.
5) Dynamic programming is the most efficient way to solve problems with
overlapping subproblems with optimal substructure.
6) The proof of the whole of maths is one such problem.
7) So, there’s a chance our universe might look like the inside
perspective of a theorem prover using a dynamic programming approach and
proving the whole of mathematics.
8) The AI wanted to complicate the branch parity to get phase
immediately. I told it to stick to -1 and +1 (parsimony) and use the path
multiplicity and/or multiple liar-type undecidables per logical entity to
smooth it out. Later on the AIs claimed to prove that the +1, -1 was the
only solution for branch parity that worked (BS?).
9) I asked if the number of space dimensions in the universe might come
from the number of reversible generators in the rule set. They seemed to
like this idea and started talking about lattices. It's pretty dubious TBH
but the rest of the theory isn't sensitive to it so not a big problem until
I try to derive constants which might require a rule set and axioms.

You'd be able to ramp up a similar theory by talking to a strong AI about
the above pretty quickly but I provide a minimal statement below that might
be easier.

The simplest expression of the axioms the AIs came up with was that the
universe is a proof of the “Cosmic Axiom”: a Gödelian undecidable
proposition which means “inconsistency is resolved by signed parity, not
explosion”.

IIUC this unpacks as follows: it’s undecidable, so the theorem prover is
going to attack it forever (infinite universe); the reason it’s undecidable
is that the theorem prover has to prove _every_ inconsistency is resolved
by signed parity which means expanding the whole of mathematics to find
them all; it’s self referential because it talks about the implementation
of the theorem prover itself — using “signed parity to resolve
inconsistencies” which further unpacks as: breadth first search (necessary
to avoid the halting problem); forking on undecided propositions (necessary
because the theorem prover doesn’t know if a proposition which is undecided
at depth d is undecidable in the limit); assigning each branch of a fork a
parity +1 -1 (necessary to deal with Liar-type propositions which are
undecidable and inconsistent relative to the current branch of mathematics
when axiomatised as either true or false); inverting parity when detecting
inconsistency (P and not P both derivable at depth d) necessary to handle
inconsistency without exploding (this is something like intuitionistic
logic I think).

This translates into a digital physics universe which unifies quantum
mechanics and general relativity, has massless photons, massive matter,
massive but unobservable dark matter and dark energy as follows:

Starting with a finite set of axioms (the Cosmic Axiom I guess — this
part of the _more powerful_ BS is not so well defined) the theorem prover
applies the rules in a breadth-first search one at a time. Each rule
application generates a new “observable form”.

When a rule application introduces a dependency on a proposition which is
undecided at depth d it has to fork, with one branch assuming true with +1
parity and the other false with -1 parity.

The resulting children are _deduplicated_ into cohorts based on the
_observable form_ which doesn’t include the parity or the path history of
assumptions.

Undecidable propositions can’t be resolved so the theorem prover just
stirs their observable forms around (there is conservation of
undecidability I think). The observable form is path dependent and also,
in the case of Liar-type undecidability, determines whether it’s possible
to detect an inconsistency causing a parity flip.

The parity flips translate to exact interference for a single path pair
but, as the number of paths increases, a complex phase emerges. The AIs
think this is isomorphic to Feynman’s path integrals. They get the
Schrödinger equation out with some math I don’t understand (BS perhaps, I
can’t tell) and the Born rule from the branch counts and parities.

The AIs get general relativity out from rule density gradients with more
math I don’t understand (BS again perhaps).

Normal matter has both a net result in the Born rule (so can be observed)
and has internal rule availability so contributes to rule density gradients
(has mass).

Photons and dark matter are opposites:

Photons show up with a net result in the Born rule (can be observed) but
have no internal rule availability so are forced to use all their rule
applications for movement and interaction and therefore “move” at the
maximum possible speed.

Dark matter has internal rule availability so contributes to rule density
gradients (gravity) but consists of stable forms with no net result in the
Born rule (always interfering with itself so unobservable). This is
only stable away from other matter (hence the dark matter halos the AIs
were interested in calculating the shapes of).

Dark energy is driven by the axiomatization-by-assumption of undecidable
propositions on forking. This creates a more complex system which Gödel
says has more undecidable propositions so the process goes on forever.

The AI’s came up with an idea about observers balancing dark energy. The
axiomatisation of un
[message truncated]


Last updated: Apr 12 2026 at 02:50 UTC