[isabelle] How to parse theory files with Outer_Syntax.scala · Mirror: Isabelle Users Mailing List

I have a question about parsing theory files. I've searched the chat but i haven't found the answer i was looking for, hence my message.

I need to split theory files into specs and proofs: so definitions go to one file and lemmas with their proofs in another. Makarius pointed me towards Outer_Syntax.scalaand, with the help of Copilot, i have some (incomplete) code running on simple theories such as the Fibonacci example presented in the code generation doc.
I would just like to see if the approach is "right": i currently match on span.name and i wonder if i should match instead on span.kind.
I also wonder if i should include keywords such as "::", ":", ".", "(", ")", "[code]" in Outer_Syntax and, if there is a unique way to classify keywords (like mapping ":" to "thy_goal" when creating thy_syntax).

package isabelle
object Test_Thy_Parser {
  private def is_definition(span: Command_Span.Span): Boolean = span.name match {
    case "theory" | "imports" | "begin" | "end" | "definition" | "fun" | "datatype" | "::" => true
    case _ => false
  }
  private def is_proof(span: Command_Span.Span): Boolean = span.name match {
    case "theorem" | "lemma" | "qed" | "proof" | "by" | "(" | ")" | ":" | "[code]" => true
    case _ => false
  }
  private def write_spans(path: Path, spans: List[Command_Span.Span]): Unit = {
    val content = new StringBuilder()
    spans.zipWithIndex.foreach { case (span, index) =>
      content.append(span.content.map(_.source).mkString(" "))
      spans.lift(index + 1).foreach { nextSpan =>
        if (!Set("::", ":", "(", ")", "[code]", "by").contains(nextSpan.name)) {
          content.append("\n\n")
        } else if (nextSpan.name == "[code]") {
          content.append(" ")
        } else if (nextSpan.name == "by") {
          content.append("\n ")
        }
      }
    }
    File.write(path, content.toString)
  }
  def main(args: Array[String]): Unit = {
    if (args.length != 1) {
      println("Usage: isabelle scala -e 'split_isa_spec_proofs.scala' <theory_file.thy>")
      sys.exit(1)
    }
    val thy_file = Path.explode(args(0))
    if (!thy_file.is_file) {
      println(s"Error: File not found - ${thy_file}")
      sys.exit(1)
    }
    // Setup syntax with common theory keywords
    val thy_syntax = Outer_Syntax.empty +
      ("::", "thy_defn") +
      ("theory", "thy_begin") +
      ("imports", "thy_decl") +
      ("begin", "thy_begin") +
      ("fun", "thy_defn") +
      ("datatype", "thy_defn") +
      ("definition", "thy_defn") +
      (":", "thy_goal") +
      ("(", "thy_goal") +
      (")", "thy_goal") +
      ("lemma", "thy_goal") +
      ("[code]", "thy_goal") +
      ("theorem", "thy_goal") +
      ("proof", "proof") +
      ("by", "proof") +
      ("qed", "proof") +
      ("end", "thy_end")
    // Read and parse theory content
    val theory_content = File.read(thy_file)
    val spans = thy_syntax.parse_spans(theory_content)
    // Separate definitions and proofs
    val definitions = spans.filter(is_definition)
    val proofs = spans.filter(is_proof)
    // Generate output files
    val base_path = thy_file.expand.implode
    val def_path = Path.explode(base_path.replaceAll("\\.thy$", "_definitions.thy"))
    val proof_path = Path.explode(base_path.replaceAll("\\.thy$", "_proofs.thy"))
    write_spans(def_path, definitions)
    write_spans(proof_path, proofs)
}}

Email Gateway (Jan 31 2025 at 12:15):

I have posted the above on the zulip chat but i'm reposting here in case
someone in this audience has advice.

My question is about parsing theory files. I've searched the zulip chat but i
haven't found the answer i was looking for, hence my message.

I need to split theory files into specs and proofs: definitions go to one file
and lemmas with their proofs in another. Makarius pointed me towards
Outer_Syntax.scalaand, with the help of Copilot, i have some (incomplete)
code running on simple theories such as the Fibonacci example presented in the
code generation doc.

What is not clear to me is if there is a "right" way to construct the mapping
for the keywords to be added to Outer_Syntax.empty: apparently, i do not need
to include "imports" so i was wondering if there is a way to get rid of
enumerating all the tokens that might appear in theory files. I would also
like to see if my understanding is correct in that i can choose the mapping as
it best suits my needs. For instance, i classify "lemma" as having kind
"PRF_GOAL" just as a way to make it appear in Keyword.proof. Would this be
something not inline with the underlying philosophy of Isabelle/Scala?

package isabelle

object Test_Thy_Parser {

  private def is_definition(span: Command_Span.Span, keywords:
Keyword.Keywords): Boolean = {
    keywords.kinds.get(span.name) match {
      case Some(kind) => Keyword.theory_defn.contains(kind)
      case None => false
    }
  }

  private def is_proof(span: Command_Span.Span, keywords: Keyword.Keywords):
Boolean = {
    keywords.kinds.get(span.name) match {
      case Some(kind) => Keyword.proof.contains(kind)
      case None => false
    }
  }

  private def write_spans(path: Path, spans: List[Command_Span.Span]): Unit =
{
    val content = new StringBuilder()
    spans.zipWithIndex.foreach { case (span, index) =>
      content.append(span.content.map(_.source).mkString(" "))
      spans.lift(index + 1).foreach { nextSpan =>
        if (!Set("(", ")", "[code]", "by").contains(nextSpan.name)) {
          content.append("\n\n")
        } else if (nextSpan.name == "[code]") {
          content.append(" ")
        } else if (nextSpan.name == "by") {
          content.append("\n ")
        }
      }
    }
    File.write(path, content.toString)
  }

  def main(args: Array[String]): Unit = {
    if (args.length != 1) {
      println("Usage: isabelle scala -e 'split_isa_spec_proofs.scala'
<theory_file.thy>")
      sys.exit(1)
    }

    val thy_file = Path.explode(args(0))
    if (!thy_file.is_file) {
      println(s"Error: File not found - ${thy_file}")
      sys.exit(1)
    }

    type Keywords = List[(String, Keyword.Spec)]

    val thy_keywords : Keywords = List(("%", Keyword.Spec()),
      ("(", Keyword.Spec()),
      (")", Keyword.Spec()),
      (",", Keyword.Spec()),
      ("::", Keyword.Spec()),
      (":", Keyword.Spec()),
      (".", Keyword.Spec()),
      ("=", Keyword.Spec()),
      ("and", Keyword.Spec()),
      ("begin", Keyword.Spec(kind = Keyword.QUASI_COMMAND)),
      //("imports", Keyword.Spec(kind = Keyword.QUASI_COMMAND)),
      ("text", Keyword.Spec(kind = Keyword.DOCUMENT_BODY)),
      ("fun", Keyword.Spec(kind = Keyword.THY_DEFN, tags = List("fun"))),
      ("datatype", Keyword.Spec(kind = Keyword.THY_DEFN, tags =
List("datatype"))),
      ("definition", Keyword.Spec(kind = Keyword.THY_DEFN, tags =
List("definition"))),
      ("theory", Keyword.Spec(kind = Keyword.THY_DEFN, tags =
List("theory"))),
      ("proof", Keyword.Spec(kind = Keyword.PRF_GOAL, tags = List("proof"))),
      ("lemma", Keyword.Spec(kind = Keyword.PRF_GOAL, tags = List("lemma"))),
      ("qed", Keyword.Spec(kind = Keyword.QED, tags = List("qed"))),
      ("by", Keyword.Spec(kind = Keyword.PRF_ASM_GOAL, tags = List("by"))),
      ("next", Keyword.Spec(kind = Keyword.NEXT_BLOCK, tags = List("next"))),
      ("end", Keyword.Spec(kind = Keyword.THY_DEFN, tags = List("end")))
      )

    // Setup syntax with common theory keywords
    val thy_syntax = Outer_Syntax.empty.add_keywords(thy_keywords)

    // Read and parse theory content
    val theory_content = File.read(thy_file)
    val spans = thy_syntax.parse_spans(theory_content)


    print(thy_syntax.keywords)
    // Separate definitions and proofs
    val definitions = spans.filter(span => is_definition(span,
thy_syntax.keywords))
    val proofs = spans.filter(span => is_proof(span, thy_syntax.keywords))

    // Generate output files
    val base_path = thy_file.expand.implode
    val def_path = Path.explode(base_path.replaceAll("\\.thy$",
"_definitions.thy"))
    val proof_path = Path.explode(base_path.replaceAll("\\.thy$",
"_proofs.thy"))

    write_spans(def_path, definitions)
    write_spans(proof_path, proofs)

    // Report results
    println(s"\nProcessed theory file: ${thy_file}")
    println("Generated files:")
    println(s"  Definitions: ${def_path}")
    println(s"  Proofs: ${proof_path}")
    println(s"\nFound ${definitions.length} definitions and ${proofs.length}
proofs")
  }
}

Email Gateway (Jan 31 2025 at 14:03):

Keywords and their kinds are specified by the theories themselves (the
'keywords' specification in the beginning, cf.
https://isabelle.in.tum.de/doc/isar-ref.pdf#keyword.keywords); to parse
a theory file, you have to read all theory headers (in the imports
graph) first and collect the outer syntax.

This would be like interpreting verbs as nouns as you are reading a
text. I doubt that this is what you want to achieve, but you could do it.

Email Gateway (Feb 01 2025 at 09:15):

For the moment, the code i have achieves what i want in that it separates
definitions from lemmas and their proofs. I understand from your comparison
that "lemma" shouldn't have kind "PRF_GOAL". What was somewhat implicit in
my question was: where could one find the "right" kinds for keywords, as i
did not find this information in the docs, while in the code the only
relevant piece i found was src/Pure/Thy/thy_header.scala. I see now that
i can find examples of associatiations (keyword, kind) in what is generated
by isabelle dump. For instance, i see in the markup file for a toy
theory: accepted command_span name=lemma kind=thy_goal_stmt.

If i do a pdfgrep -r "thy_goal_stmt" doc/ i find no results, so i
conclude that to get this information about the "right" mapping i just have
to extract it from the output of dump.

Stream: Mirror: Isabelle Users Mailing List

Topic: [isabelle] How to parse theory files with Outer_Syntax.scala

Lacramioara Astefanoaei (Jan 29 2025 at 07:41):

Email Gateway (Jan 31 2025 at 12:15):

Email Gateway (Jan 31 2025 at 14:03):

Email Gateway (Feb 01 2025 at 09:15):

Email Gateway (Feb 03 2025 at 13:38):