Stream: New Members & Projects

Topic: Annotated Isabelle scripts for pretraining


view this post on Zulip Qiyuan Xu (Jan 28 2026 at 11:45):

Using my Minilang pipeline, I annotated the whole Isabelle/HOL stdlib and AFP-2025-02-12, adding the proof goal information and others into the script, resulting in ~500 million tokens suitable for LLM pretraining. The data is available on https://huggingface.co/datasets/ANTPG/annotated-isabelle


Last updated: Feb 04 2026 at 02:22 UTC