Using my Minilang pipeline, I annotated the whole Isabelle/HOL stdlib and AFP-2025-02-12, adding the proof goal information and others into the script, resulting in ~500 million tokens suitable for LLM pretraining. The data is available on https://huggingface.co/datasets/ANTPG/annotated-isabelle
Last updated: Feb 04 2026 at 02:22 UTC